A Complete Introduction to Using BERT Models

BERT model is one of the first Transformer application in natural language processing (NLP). Its architecture is simple, but sufficiently do its job in the tasks that it is intended to. In the following, we’ll explore BERT models from the ground up — understanding what they are, how they work, and most importantly, how to use them practically in your projects. We’ll focus on using pre-trained models through the Hugging Face Transformers library, making advanced NLP accessible without requiring deep learning expertise.

Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.

Let’s get started.

A Complete Introduction to Using BERT Models
Photo by Taton Moïse. Some rights reserved.

Overview

This post is divided into five parts; they are:

  • Why BERT Matters
  • Understanding BERT’s Input/Output Process
  • Your First BERT Project
  • Real-World Projects with BERT
  • Named Entity Recognition System

Why BERT Matters

Imagine you’re teaching someone a new language. As they learn, they need to understand words not just in isolation, but in context. The word “bank” means something completely different in “river bank” versus “bank account.” This is exactly what makes BERT special — it understands language in context, just like humans do.

BERT has revolutionized how computers understand language by:

  1. Processing text bidirectionally (both left-to-right and right-to-left simultaneously)
  2. Understanding context-dependent meanings
  3. Capturing complex relationships between words

Let’s use a simple example to understand this: “The patient needs to be patient to recover.” A traditional model might get confused by the words “patient” as noun and “patient” as adjectie. BERT, however, understands the different meanings based on their context in the sentence.

Understanding BERT’s Input/Output Process

The code in this post uses the Hugging Face transformers library. Let’s install that with Python’s pip command:

You should think of BERT as a highly skilled translator who needs text formatted in a specific way. Let’s break down this process:

When you run this code, you’ll see three different representations of the same text as follows:

Let’s understand what’s happening:

  • The original text is your raw input text.
  • The tokenized text are words broken down into BERT’s vocabulary units. It seems like words are defined by boundaries between alphabets and non-alphabets but a tokenizer may implement a different algorithm.
  • The token IDs are integers that the BERT model actually processes. It is important to remember that BERT is a neural network model that can process only numerical input. Tokenized strings need to be converted into a numerical form before the model can use it.

BERT the deep learning model need not only to understand what your input text is, but also the structure of your input. To illustrate, see the following:

This shows:

From the above, you can see that BERT tokenizer adds:

  • [CLS] token at the start (used for classification tasks)
  • [SEP] token at the end (marks sentence boundaries)
  • Padding tokens [PAD] (optional, if padding argument is set to make all sequences the same length)

Your First BERT Project

BERT is a model for multiple purposes. In the transformers library, you can refer to BERT by name and let the library to load and configure the model automatically.

Let’s start with the simplest BERT application: sentiment classification:

Running this code will print:

If you run this code the first time, you will see progress bar printed, like the following:

This is because BERT as a deep learning model has its code implemented in the transformers library, but the weight should be downloaded from Hugging Face Hub on demand. This progress bar is printed when the model is downloaded into your local cache.

The code above created a pipeline using a high-level API, that (a) handles all tokenization from the input, (b) pass on tokenized input to the model, and (c) convert model output back to human-readable result. Pretrained model will be downloaded if necessary when the pipeline is created.

The model used is "distilbert-base-uncased-finetuned-sst-2-english", or the uncased version of DistilBERT. It runs faster than the original BERT and use less memory while maintains similar accuracy as BERT. It is uncased hence the input text is case insensitive. This model is trained on English data, and you should not expect it can understand another language.

It is a sentiment model and its output is either “POSITIVE” or “NEGATIVE”, describing the tone of the input text, with a confidence level between 0 and 1.

Real-World Projects with BERT

The code snippets above works but not robust for production use. Let’s enhance it:

  • Not using the pipeline, but call each component directly
  • Limit the input length and truncate if too long, to prevent overwhelming the computer
  • Use GPU if available
  • Provide more data, e.g., the confidence in both “POSITIVE” and “NEGATIVE”

Below is the modified code, to make it into a Python class:

Here the preprocess_text() and predict() functions are combined, but the workflow is the same: Let the tokenizer process the input as a text string into

At initialization, GPU is used if available, as indicated by torch.cuda.is_available(). The tokenizer and model are created using AutoTokenizer and AutoModelForSequenceClassification. According to the documentation of the specific model, the output are two values, NEGATIVE and POSITIVE. You set up the labels in this order.

At text preprocessing function preprocess_text(), text are cleaned with extra spaces and then tokenized with BERT’s special tokens ([CLS], [SEP]). Truncation of long input or padding of short input also applied at tokenization. The output of the tokenizer will be a dict of keys input_ids (a tensor of token IDs) and attention_mask (a tensor of 0 or 1, indicating whether a valid token is present at that location).

At the prediction logic predict(), it runs the model with the inputs (input_ids and attention_mask) and then converts the outputs to probabilities, and return the detailed prediction information.

Let’s test out this implementation:

Here is what this code prints:

As you can see, the model accurately predicted sentiments from the provided statements.

Named Entity Recognition System

If you read the original paper of the BERT model, you will find that it is not designed for sentiment classification, but as a generic language model. It can be adapted to other use.

One example is using BERT for named entity recognition (NER). This is to identify proper nouns (names, organizations, locations) in text. It is a difficult problem because, unlike other words that you can have a dictionary to check whether it is a verb or a pronoun, the named entities usually not found in dictionary, so you cannot check with a look up table. Further, some named entities are multiple words, such as “European Union”, in which they should be identified together as one entity.

You can find a pretrained BERT NER model from Hugging Face Hub, too. Below is how you should modifying the previous code for NER:

The name of the model "dslim/bert-base-NER" is what you can search to find out in the Hugging Face Hub. It is a model based on BERT and specialized for NER.

In this class, the function recognize_entities() combines the tokenization and the model inference. If you check the output of the tokenizer, you will find that a dict of three tensors are produced, under the keys input_ids, token_type_ids, and attention_mask. This is different from the previous example and hence you should use a matching tokenizer for the model.

The model is created with AutoModelForTokenClassification, which token classification means to tag each input token as its output. Named entity recognition is achieved using the B-I-O tagging scheme, i.e., beginning-inside-outside. Beginning is the token that a named entity starts. If there are multiple tokens belong to the same named entity, the second token and thereafter will be tagged as “inside”. The token that are not a part of a named entity will be tagged as “outside”.

The list labels converted from the model prediction will be like:

In which the prefix B-, I- indicates the first and subsequent tokens of an entity. The suffix ORG or PER tells the type of the entity, e.g., an organization, a person, a location, or other miscellaneous entities.

The for-loop just enumerates all the named entities found. Since the model’s tokenizer may create subword tokens, a token that starts with ## will be considered as a subword and combined with the previous token at output.

Let’s see it in action:

Here is what you’ll get after running this code.

The model accurately recognized the entities. Not only multi-word entities are identified, but also their nature.

Summary

In this comprehensive tutorial, you learned about the BERT model and it’s applications. Particularly, you learned:

  • What’s BERT and how it processes input and output text.
  • How to setup BERT and build real-world applications with a few lines
    of code without knowing much about the model architecture.
  • How to build a sentiment analyzer with BERT.
  • How to build a Named Entity Recognition (NER) system with BERT.

Want to Use Powerful Language Models in Your NLP Projects?

NLP with Hugging Face Transformers

Run State-of-the-Art Models on Your Own Machine

...with just a few lines of Python code

Discover how in my new Ebook:
NLP with Hugging Face Transformers

It covers hands-on examples and real-world use cases on tasks like: Text classification, summarization, translation, Q+A, and much more...

Finally Bring Advanced NLP to
Your Own Projects

No theory. Just Practical, Working Code

See What's Inside

7 Responses to A Complete Introduction to Using BERT Models

  1. Gaurav February 6, 2025 at 1:55 pm #

    Excellent explanation and very nicely detailed examples.Thanks a ton!!

    • James Carmichael February 7, 2025 at 9:01 am #

      You are very welcome, Gaurav!

  2. Alberto February 9, 2025 at 1:41 am #

    Excellent explanation and very nicely detailed examples, as Gaurav said above. One improvement I would add is the required PC specs to run the examples. Just to allow users to decide, from the start, whether his/her PC will be able to run everything.

    • James Carmichael February 9, 2025 at 9:11 am #

      Thank you for your feedback Alberto!

  3. Kai February 11, 2025 at 7:30 am #

    Nicely done. Works excellent by simply running top-to-bottom in a Jupyter notebook as provided without any issues. I considered this an hour well spent.
    Please edit the explanations again; one reason why I come back to this site and its publications is the overall quality. And in my opinion that does include the English language and not just the programming language.

    • James Carmichael February 11, 2025 at 8:57 am #

      Hi Kai…I appreciate your detailed feedback! If you’d like, I can go through the explanations in “A Complete Introduction to Using BERT Models” and refine them for clarity, conciseness, and grammatical accuracy. Let me know if you have any specific sections you want me to focus on, or if you’d prefer a complete revision of the explanations.

      • Kai February 12, 2025 at 3:39 am #

        Hi James,
        just read through the explanatory passages and you see what I mean, the further down you get the worse it gets.

        Look, when we talk about NLP, Data Science, and sentiment analysis in language then we can all agree that language and grammar is very important. hence the accompanying text should be as well.
        At the same time (and that is very important), we never want to stop people from publishing just because their command of the English language is not 100%. Science would be at a dead end! And that is why we have editors. 🙂
        Your publications obviously go through a (I would assume rigorous) editing process, so should these blog posts. That’s all I mean.

        (this is meant as a private, general comment to you as the editor, I do not see a need to publish this as a comment on this blog post)

Leave a Reply

Machine Learning Mastery is part of Guiding Tech Media, a leading digital media publisher focused on helping people figure out technology. Visit our corporate website to learn more about our mission and team.