Create a Tokenizer and Train a Huggingface RoBERTa Model from Scratch

Published in

Analytics Vidhya

7 min readAug 16, 2021

Brief Introduction

This blog post is the first part of a series where we want to create a product names generator using a transformer model. For a few weeks, I was investigating different models and alternatives in Huggingface to train a text generation model. We have a shortlist of products with their description and our goal is to obtain the name of the product. I did some experiments with the Transformer model in Tensorflow as well as the T5 summarizer. Finally, in order to deepen the use of Huggingface transformers, I decided to approach the problem with a somewhat more complex approach, an encoder-decoder model. Maybe it was not the best option but I wanted to learn new things about Huggingface Transformers. In the next post of the series, we will introduce you deeper into this concept.

Here, in this first part, we will show how to train a tokenizer from scratch and how to use the Masked Language Modeling technique to create a RoBERTa model. This personalized model will become the base model for our future encoder-decoder model.

Our own solution

For our experiment, we are going to train from scratch a RoBERTa model, it will become the encoder and the decoder of a future model. But our domain is very specific, words and concepts about clothes, shapes, colors, … Therefore, we are interested in defining our own tokenizer created from our specific vocabulary, avoiding including more common words from other domains or use cases that are irrelevant for our final purpose.

We can describe our training phase in three main steps:

Create and train a byte-level, Byte-pair encoding tokenizer with the same special tokens as RoBERTa
Train a RoBERTa model from scratch using Masked Language Modeling, MLM.

The code is available in this Github repository. In this post, we will only show you the main code sections and some explanations for them.

The Dataset

As we mentioned before, our dataset contains around 31.000 items, about clothes from an important retailer, including a long product description and a short product name, our target variable. First, we execute an exploratory data analysis and we can observe that the count of rows with outliers values is a small number. The count of words looks like a left-skewed distribution, 75% of rows in the range 50–60 words and a maximum of about 125 words. The target variable contains about 3 to 6 words.

Train a Tokenizer

The Stanford NLP group define the tokenization as:

“Given a character sequence and a defined document unit, tokenization is the task of chopping it up into pieces, called tokens , perhaps at the same time throwing away certain characters, such as punctuation. ”

A tokenizer breaks a string of characters, usually sentences of text, into tokens, an integer representation of the token, usually by looking for whitespace (tabs, spaces, newlines). It usually splits a sentence into words but there are many options like subwords.

“We will use a byte-level Byte-pair encoding tokenizer, byte pair encoding (BPE) is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data”, “Byte pair encoding”, Wikipedia definition. The benefit of this method is that it will start building its vocabulary from an alphabet of single chars, so all words will be decomposable into tokens. We can avoid the presence of unknown (UNK) tokens.

A great explanation of tokenizers can be found on the Huggingface documentation, https://huggingface.co/transformers/tokenizer_summary.html.

To train a tokenizer we need to save our dataset in a bunch of text files. We create a plain text file for every description value.

Now we can train our tokenizer on the text files created and containing our vocabulary, we need to specify the vocabulary size, the min frequency for a token to be included, and the special tokens. We choose a vocab size of 8,192 and a min frequency of 2 (you can tune this value depending on your max vocabulary size). The special tokens depend on the model, for RoBERTa we include a shortlist:

<s> or BOS, beginning Of Sentence
</s> or EOS, End Of Sentence
<pad> the padding token
<unk> the unknown token
<mask> the masking token.

The count of samples is small and the tokenizer trains very fast. Now we can save the tokenizer to disk, later we will use it to train the language model.

We now have both a vocab.json, which is a list of the most frequent tokens ranked by frequency and it is used to convert tokens to IDs, and a merges.txt file that maps texts to tokens.

# vocab.json
{
    "<s>": 0,
    "<pad>": 1,
    "</s>": 2,
    "<unk>": 3,
    "<mask>": 4,
    "!": 5,
    "\"": 6,
    "#": 7,
    "$": 8,
    "%": 9,
    "&": 10,
    "'": 11,
    "(": 12,
    ")": 13,
    # ...
}
 
# merges.txt
l a
Ġ k
o n
Ġ la
t a
Ġ e
Ġ d
Ġ p
# ...

What is great is that our tokenizer is optimized for our very specific vocabulary, instead of a generic tokenizer trained for common English.

We can instantiate our tokenizer using both files and test it with some text from our dataset.

Later, when training the model we will initialize the tokenizer using the from_pretrainedmethod.

Train a language model from scratch

We’ll train a RoBERTa model, which is BERT-like with a couple of changes (check the documentation for more details). In summary: “It builds on BERT and modifies key hyperparameters, removing the next-sentence pretraining objective and training with much larger mini-batches and learning rates”, Huggingface documentation on RoBERTa.

As the model is BERT-like, we’ll train it on a task of Masked Language Modeling. It involves masking part of the input, about 10–20% of the tokens, and then learning a model to predict the missing tokens. MLM is often used within pre-training tasks, to give models the opportunity to learn textual patterns from unlabeled data. It can be fine-tuned to a particular downstream task. The main benefit is that we do not need labeled data (hard to obtain), no text needs to be labeled by human labelers in order to predict the missing values.

We are going to train the model from scratch, not from a pretrained one. We create a model configuration for our RoBERTa model, setting the main parameters:

Vocabulary size
Attention heads
Hidden layers

Finally, let’s initialize our model using the configuration file. As we are training from scratch, we initialize from a config that defines the architecture of the model but not restoring previously trained weights. The weights will be randomly initialized.

And we recreate our tokenizer, using the tokenizer trained and saved in the previous step. We will use a RoBERTaTokenizerFast object and the from_pretrained method, to initialize our tokenizer.

Building the training dataset

We’ll build a Pytorch dataset, subclassing the Dataset class. The CustomDataset receives a Pandas Series with the description variable values and the tokenizer to encode those values. The Dataset returns a list of tokens for every product description in the Series.

In order to evaluate the model during training, we will generate a training dataset and an evaluation dataset.

Once we have the dataset, a Data Collator will help us to mask our training texts. This is just a small helper that will help us batch different samples of the dataset together into an object that PyTorch knows how to perform backprop on. Data collators are objects that will form a batch by using a list of dataset elements as input and may apply some processing like padding or random masking. The DataCollatorForLanguageModeling method allows us to set the probability with which to randomly mask tokens in the input.

Initialize and train our Trainer

When we want to train a transformer model, the basic approach is to create a Trainer class that provides an API for feature-complete training and contains the basic training loop. First, we define the training arguments, there are many of them but the more relevant are:

output_dir where the model artifacts will be saved
evaluation_strategy when the validation loss will be computed
num_train_epochs
per_device_train_batch_size is the batch size for training
per_device_eval_batch_size is the batch size for evaluation
learning_rate, initialize to 1e-4
weight_decay, 0.01

Finally, we create a Trainer object using the arguments, the input dataset, the evaluation dataset, and the data collator defined. And now we are ready to train our model.

As a result, we can watch how the loss is decreasing while training.

Finally, we save the model and the tokenizer in a way that they can be restored for a future downstream task, our encoder-decoder model.

Checking the trained model using a Pipeline

Looking at the training and eval losses going down is not enough, we would like to apply our model to check if our language model is learning anything interesting. An easy way is via the FillMaskPipeline.

Pipelines are simple wrappers around tokenizers and models. We can use the ‘fill-mask’ pipeline where we input a sequence containing a masked token (<mask>) and it returns a list of the most probable filled sequences, with their probabilities. But only works, for inputs with exactly one token masked (See Huggingface official documentation).

And that’s all for this first part, and I hope you’ve found inspiration for your future projects and we’ll be back with the second part in a few days or so I hope.

The code is available in this Github repository.

References

Feb 2020, “How to train a new language model from scratch using Transformers and Tokenizers”, Huggingface Blog.

“Encoder-Decoder models”, Huggingface official documentation

RoBERTa documentation from Huggingface

“Byte pair encoding”, Wikipedia definition.