Preparing a nlp dataset for MLM

Hi I’am trying to use nlp datasets to train a RoBERTa Model from scratch and I am not sure how to perpare the dataset to put it in the Trainer:

!pip install datasets
from datasets import load_dataset
dataset = load_dataset('wikicorpus', 'raw_en')

from transformers import DataCollatorForLanguageModeling
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, 
    mlm=True, 
    mlm_probability=0.15)

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./",
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_gpu_train_batch_size=16,
    save_steps=10_000,
    save_total_limit=2)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset)

How do I have to dataset.set_format() such that it only takes the text of the dataset, line-by-line?
Or what’s the proper way to prepare the dataset for MLM?

In the past I have been doing it with:

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
    tokenizer=tokenizer,
    file_path="/dataset.txt"
)

which will be removed soon and does not support multiple txt files.

Thanks

You should have a look at the preprocessing done in the run_mlm example. There is also the corresponding notebook that can help.

Dear there
The first link does not work. And the notebook you’ve mentioned does not even work in colab.

This is the link run_mlm.py
I’m not sure about the notebook though.

If you’re preparing an NLP dataset for a Masked Language Model (MLM), it’s important to have high-quality, diverse data to ensure the model can effectively understand and predict contextual language. For a comprehensive list of NLP datasets to help you get started, check out this blog: - Top NLP Datasets to Supercharge Your Machine Learning Models . These datasets offer a variety of text sources that can support a range of NLP tasks, including MLM training.