Fine-tune a RoBERTa Encoder-Decoder model trained on MLM for Text Generation

Eduardo Muñoz

Published in

Analytics Vidhya

9 min readOct 4, 2021

Part 2 of a model to generate names from product descriptions

Our problem description

As I mentioned in my previous post, for a few weeks I was investigating different models and alternatives in Huggingface to train a text generation model. We have a shortlist of products with their description and our goal is to obtain the name of the product. Finally, in order to deepen the use of Huggingface transformers, I decided to approach the problem with a different approach, an encoder-decoder model. Maybe it was not the best option, but I wanted to learn new things about Huggingface Transformers.

First, I must admit that probably a text generation problem is not usually approached with this kind of solution, using encoders models like BERT or RoBERTa. But in this problem, we are not going to generate “free text”, so we can simplify our task. We are looking for a subset of words from the product description to compose the product name and our full vocabulary is present in the input data. From this point of view, we can encode the product description into a vector representation and decode it into a text name. Therefore, the use of an encoder-decoder is presented as an option to evaluate.

Our problem can be represented as a sequence-to-sequence problem, where we need to find a mapping of an input sequence (the product description) to an output sequence (the product name). In a Huggingface blog post “Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models” you can find a deep explanation and experiments building many encoder-decoder models using BERT or GTP2 transformers model. I highly recommend you to read it.

In a previous Medium post, we created a custom tokenizer and trained a RoBERTa model, “Create a Tokenizer and Train a Huggingface RoBERTa Model from Scratch”. Now, we will use that trained model to build an encoder-decoder model and we will fine-tune this new model on our dataset. We will only describe the more interesting sections of code, you can find the full code on my Github repo.

The Dataset

As we mentioned in my post, our dataset contains around 31.000 items, about clothes from an important retailer, including a long product description and a short product name, our target variable. First, we execute an exploratory data analysis and we can observe that the count of rows with outliers values is a small number. It is a private dataset so we can not make it publicly available.

Encoder-Decoder Architecture

In many machine learning solutions, we can find architectures where the first section of the DNN produces a vector representation of the input and the second half takes that vector and an input an generate an expected result. This is basically an encoder-decoder model and with RNN they become a very powerful tool to solve NLP tasks.

“Analogous to RNN-based encoder-decoder models, transformer-based encoder-decoder models consist of an encoder and a decoder which are both stacks of residual attention blocks. The key innovation of transformer-based encoder-decoder models is that such residual attention blocks can process an input sequence of variable length without exhibiting a recurrent structure. Not relying on a recurrent structure allows transformer-based encoder-decoders to be highly parallelizable,…” [2] “Transformers-based Encoder-Decoder models”

The purpose of this architecture is to find a mapping function between an input sequence and its targeted output sequence. In the case of transformers, the encoder encodes the input sequence to a sequence of hidden states then the decoder takes de target sequence and the encoded hidden states to model or learn the mapping function. But the encoder only works in the first step, in the next steps the decoder receives the next target token and reuses the hidden states from the encoder produced in step 1. A deep explanation can be found on the mentioned post [2] “Transformers-based Encoder-Decoder models”.

With this idea in mind, we can consider an encoder-decoder model as an encoder-only model, such as BERT, and a decoder-only model, such as GPT-2, both combined to produce a target sequence. At this moment, we have many pretrained models available in Huggingface’s model hub, so the first option to evaluate is using these pretrained models to build our encoder-decoder and fine-tune it to our specific task.

In the post “Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models”,[1], many combinations are explained and evaluated, but in summary, you need to consider how to initialize the models:

Initialize both the encoder and the decoder from a pretrained encoder-only
Initialize the encoder from an encoder-only checkpoint and the decoder from a decoder-only
Initialize only the encoder or only the decoder from a pretrained encoder or decoder-only model
Sharing or not the encoder weights with the decoder
which transformer model use in the encoder and the decoder, it could be a BERT, GPT-2, or RoBERTa model.

In our experiment we will try a RoBERTaShared strategy, where both the encoder and the decoder are RoBERTa based and they share their weights. This combination has shown great results in many tasks.

Create the encoder-decoder model from a pretrained RoBERTa model

Load the trained tokenizer on our specific language

As we mentioned previously, we have trained a tokenizer and a RoBERTa model from scratch using the Masked Language Modelling technique trying to focus our model on our specific task. Now we can configure our encoder-decoder using this pretrained model.

Let me show you which libraries we will import for this project.

The first step is loading the tokenizer we need to apply to generate our input and target tokens and transform them into a vector representation of the text data.

Prepare and create the Dataset

In the next step, we need to generate the dataset for our model training. Using the tokenizer loaded, we tokenize the text data, apply the padding technique, and truncate the input and output sequences. Remember that we can define a maximum length for the input data and a different length for the output one. Finally, we define a batch size to split the dataset into batches for training and evaluation.

RoBERTa is a variant of a BERT model so the expected inputs are similar: the input_ids and the attention_mask. But RoBERTa doesn’t have token_type_ids parameter like BERT, it doesn’t make sense, because it was not training for Next Sentence Prediction so you don’t need to indicate which token belongs to which segment. Just separate your segments with the separation token tokenizer.sep_token (or </s>).

Create the RoBERTa Encoder-Decoder model

We are building our model based on the pretrained model we build in Part 1 of this series, thanks to Hugginface’s libraries and wrappers it is very simple to create the model. We just need to invoke the method from_encoder_decoder_pretrained from the EncoderDecoderModel Class of the Hugginface transformers library. We set the encoder and decoder pretrained model passing the folder where we saved our RoBERTa model trained and indicate that we want to "tie" the weights of both models, so they share their weights.

Once the model is built, we need to specify a bunch of parameters like the special tokens, to start a sentence or bos and the end of sentence or eos.

Very important parameters to declare are those relative to the decoding strategy. Any text generator needs to define how the next output or token is selected from all the available possibilities. This kind of architecture tries to model the probability of the next token considering the sequence of previous tokens. At this point, we generate the probabilities of the tokens to be the next one and we need a method to select one of them.

Decoding strategies

We are not going to analyze all the possibilities but we want to mention some of the alternatives that the Huggingface library provides.

Our first and most intuitive approximation is the Greddy search where we take the word with the highest probability in our vocabulary at every step. Unfortunately, this strategy tends to produce patterns and repeat itself in cyclic dependencies. It could be a very deterministic method and that is not what a text generator should be. And this strategy creates the problem of a high-probability token “hiding behind” a low-probability token that precedes it in the order of the sentence. And we can get stuck in a suboptimal solution.

Another approach that tries to minimize this problem is the Beam Search, which maintains multiple possible paths or sequences to approximates the solution. From this beam of results, a greedy search is applied to allow us to deterministically approximate the most likely sequence of words. It is not a very creative procedure but it works fine for translation tasks, translating sentences between languages does not require great creativity. If we want to generate a story or a dialogue we need to “explore” alternatives and reduce determinism.

Sampling techniques are a way to introduce random selections in our text generation process. But using random sampling we cannot control how the output is produced and a “guided” sampling is a better approach. “A simple, but very powerful sampling scheme, called Top-K sampling. In Top-K sampling, the K most likely next words are filtered and the probability mass is redistributed among only those K next words”, [3] .”This has the effect of constraining the generation process to select words that the model itself has deemed to be more sensible in the context”.

During our training stage, we apply a beam search strategy using 10 beams and penalize repeated n-grams and lengthy outputs.

Training the Encoder-Decoder

The Trainer component of the Huggingface library will train our new model in a very easy way, in just a bunch of lines of code. The Trainer API provides all capabilities we need to train almost any transformer model and we do not have to code the training loops at all.

When training an ML model we need an evaluation metric, we chose the Rouge Score metric that it is used to evaluate translation and summarization task. We are focusing our problem as a summarization, from the product description to the name. This metric seems to be the right choice, but there could be other interesting alternatives.

The datasets library provides us with that metric, so we load it and define a compute_metrics method where we calculate the metric for the target text, the product description, and the predicted text.

Now it is time to set the training arguments: batch_size, training epochs, save the model, etc. And then we can instantiate a Seq2SeqTrainer , a subclass of the Trainer object we mentioned, selecting the model to train, the training arguments, the metrics computation, the train, and the evaluation datasets.

And we are ready to train our encoder-decoder model for just a few epochs, finally, we save the fine-tuned encoder-decoder model.

Evaluate the model on the test dataset

Once we have our model trained, we can use it to generate names for our products and check the result of our fine-tuning process on our objective task.

We load a test dataset, without labels, and delete rows containing null values.

Then we load the Tokenizer and the fine-tuned model from a saved version.

In order to improve the results, we will define two methods to generate the text, using the Beam search decoding strategy and random sampling, and we will apply both of them and compare the results.

Now, we can make predictions for the test dataset using Beam search strategy and Top-k sampling technique.

And we are ready to visualize the results, using beam search or a sampling technique as decoding strategies.

Product Description: loosefitting dress with round neckline long sleeves pleat details buttoned opening back.height model 69.6″

Name using BS: flowing dress with pleats Name using

Top-K Sampling: printed dress with pleats

Conclusion

You have seen how you can build an encoder-decoder based on transformers and fine-tune it to solve a specific task. It is simple but a powerful tool you can use to customize your ML solutions. This is just an example and probably it can be largely improved. I hope you enjoy it and have fun doing some research.

The code is available in my GitHub repo, here is the link to the notebook.