Preparing Data for BERT Training

BERT is an encoder-only transformer model pretrained on the masked language model (MLM) and next sentence prediction (NSP) tasks before being fine-tuned for various NLP tasks. Pretraining requires special data preparation. In this article, you will learn how to:

  • Create masked language model (MLM) training data
  • Create next sentence prediction (NSP) training data
  • Set up labels for BERT training
  • Use Hugging Face datasets to store the training data

Let’s get started.

Preparing Data for BERT Training
Photo by Daniel Gimbel. Some rights reserved.

Overview

This article is divided into four parts; they are:

  • Preparing Documents
  • Creating Sentence Pairs from a Document
  • Masking Tokens
  • Saving the Training Data for Reuse

Preparing Documents

Unlike decoder-only models, BERT’s pretraining is more complex. As mentioned in the previous post, pretraining optimizes the sum of losses from MLM and NSP tasks. Therefore, training data must be labeled for both tasks.

Let’s follow Google’s BERT implementation using the Wikitext-2 or Wikitext-103 dataset. Each line in the dataset is either empty, a title line starting with “=“, or regular text. Only regular text lines are used for training.

BERT training requires two “sentences” per sample. For simplicity, define:

  • A “sentence” is a line of text in the dataset
  • A document is a sequence of consecutive “sentences”, separated by empty lines or title lines

Assuming you have trained a tokenizer as in the previous post, let’s create a function to collect text into a list of documents:

This code processes text lines sequentially. When encountering a document delimiter, it creates a new list for subsequent lines. Each line is stored as a list of integers from the tokenizer.

Tracking documents is essential for the NSP task: Two sentences form a “next sequence” pair only if both come from the same document.

Create Sentence Pairs from a Document

The next step extracts sentence pairs from documents. Pairs can be consecutive sentences from the same document or random sentences from different documents. Let’s use the following algorithm to create the pairs:

  1. Scan each sentence from each document as the first sentence
  2. For the second sentence, pick either the next sentence from the same document or a random sentence from another document

But there is a constraint: the total length of the sentence pair must not exceed BERT’s maximum sequence length. You need to truncate the sentences if necessary.

Here’s how you can implement this algorithm in Python:

This code creates pairs from a given document at index doc_idx. The initial for-loop copies sentences as chunks to avoid mutating the original document. The while-loop scans chunks until it reaches the target token length, then randomly splits them into two segments.

With 50% probability, the second sentence is replaced with a random sentence from another document. This large if-block creates the NSP task label (recorded in is_random_next) and samples a sentence from another document.

At the end of each iteration, chunks is updated to retain unused portions. The document is exhausted when this list empties. Both sentence_a and sentence_b are lists of integer token IDs.

This approach follows Google’s original BERT implementation, though it doesn’t exhaust all possible combinations. The pairs created above may exceed the target sequence length; you need to truncate them. The truncation is implemented as follows:

Truncation is applied iteratively to the longer sentence until the total length falls below the target. Tokens are removed from either end with equal probability. The result may be a chunk from the middle of the original sentence, hence the naming convention “chunk” in the code above.

Masking Tokens

Masked tokens are the most critical part of BERT training data. The original paper specifies that 15% of tokens are masked. This means the model is trained to predict only 15% of its output. Within this 15%, the token could be:

  • 80% of the time, the token is replaced with the [MASK] token.
  • 10% of the time, the token is replaced with a random token from the vocabulary.
  • 10% of the time, the token is unchanged.

Predicting only 15% of tokens, rather than all tokens, is an engineering decision. Masking or distorting a lot of tokens will make the meaning of the sentence very difficult to guess. But masking only a few tokens while training on all tokens creates an imbalanced dataset that remains largely unchanged, and training will be inefficient.

The model must correctly predict the original token for all 15% of selected tokens. After creating pairs, let’s implement masking as follows:

This function creates a token sequence: [CLS] <text_1> [SEP] <text_2> [SEP], where <text_1> and <text_2> are the sentence pair with masked tokens. The IDs of the special tokens are provided by the tokenizer.

First, the sentence pair is truncated to fit the maximum sequence length, reserving space for three special tokens. The sequence is then padded to the expected length. Segment labels are created to distinguish sentences: 0 for the first sentence, 1 for the second, and -1 for padding.

All non-special tokens are masking candidates. Their indices are shuffled, and the first num_predictions positions are selected. This number depends on mask_prob (default 15%) and is capped at max_predictions_per_seq (default 20):

The variable mlm_positions is a list of the indices of the masked positions, in ascending order. The variable mlm_labels is a list of the original tokens at the masked positions. When a random token from the vocabulary is needed, you pick one from the tokenizer:

The first four tokens in the vocabulary are special tokens. They will not be selected for masking. The dictionary ret to return is indeed the “sample” that you will use to train the BERT model.

Saving the Training Data for Reuse

So far, you have learned how to process the raw dataset into sentence pairs with masked tokens for MLM and NSP training. With the code above, you can create a list of dictionaries as the training data. However, this may not be the best way you want to serve the training loop.

The standard way to serve the training data in PyTorch code is to use the Dataset class. That is, to define a class like the following:

The key is to provide the __len__ and __getitem__ methods to return the total number of samples and the sample at the given index in the dataset, respectively. However, this may not be optimal for BERT training, as you would likely need to load the entire dataset into memory at initialization. This is not efficient when the dataset is large.

An alternative is to use the Dataset class from the Hugging Face datasets library. It hides many data-management details so you can focus on more important tasks. Let’s assume you created a generator function that yields samples:

You can create a dataset object with:

These two lines of code will retrieve all samples from the generator function and save them to a parquet file. Depending on the size of your dataset, this may take a while. The lambda construct is needed because from_generator() expects a callable that returns a generator.

Once you have the dataset in parquet format, you can load it back and try to print a few samples:

This is where the parquet format shines. The Hugging Face datasets library also supports JSON and CSV formats. But parquet is a compressed columnar format that is more efficient for data storage and retrieval. Setting streaming=True is optional. This allows you to load only the subset of the dataset you need, rather than loading the entire dataset into memory at once.

Putting everything together, this is the complete code:

Running this code, you will see the output like the following:

The timestamps printed intermittently are intentional to show the time spent. This code processes the Wikitext-103 dataset, and it takes several hours. Once complete, the parquet file enables fast, efficient iteration over samples. For testing, you can use the smaller Wikitext-2 dataset instead. You can see how the code runs in a few minutes.

Further Readings

Below are some resources that you may find useful:

Summary

In this article, you learned how to prepare the data for BERT training. You learned how to create masked language model (MLM) training data and next sentence prediction (NSP) training data. You also learned how to save the data in parquet format for reuse.

No comments yet.

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.