A Complete Introduction to Using BERT Models

By Muhammad Asad Iqbal Khan on May 15, 2025 in Hugging Face Transformers 7

BERT model is one of the first Transformer application in natural language processing (NLP). Its architecture is simple, but sufficiently do its job in the tasks that it is intended to. In the following, we’ll explore BERT models from the ground up — understanding what they are, how they work, and most importantly, how to use them practically in your projects. We’ll focus on using pre-trained models through the Hugging Face Transformers library, making advanced NLP accessible without requiring deep learning expertise.

Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.

Let’s get started.

A Complete Introduction to Using BERT Models
Photo by Taton Moïse. Some rights reserved.

Overview

This post is divided into five parts; they are:

Why BERT Matters
Understanding BERT’s Input/Output Process
Your First BERT Project
Real-World Projects with BERT
Named Entity Recognition System

Why BERT Matters

Imagine you’re teaching someone a new language. As they learn, they need to understand words not just in isolation, but in context. The word “bank” means something completely different in “river bank” versus “bank account.” This is exactly what makes BERT special — it understands language in context, just like humans do.

BERT has revolutionized how computers understand language by:

Processing text bidirectionally (both left-to-right and right-to-left simultaneously)
Understanding context-dependent meanings
Capturing complex relationships between words

Let’s use a simple example to understand this: “The patient needs to be patient to recover.” A traditional model might get confused by the words “patient” as noun and “patient” as adjectie. BERT, however, understands the different meanings based on their context in the sentence.

Understanding BERT’s Input/Output Process

The code in this post uses the Hugging Face transformers library. Let’s install that with Python’s pip command:

pip install transformers torch

1	pip install transformers torch

You should think of BERT as a highly skilled translator who needs text formatted in a specific way. Let’s break down this process:

from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Text example
text = "I love machine learning!"

# Tokenize the text
tokens = tokenizer.tokenize(text)
print(f"Original text: {text}")
print(f"Tokenized text: {tokens}")

# Convert tokens to IDs
input_ids = tokenizer.convert_tokens_to_ids(tokens)
print(f"Token IDs: {input_ids}")

from transformers import BertTokenizer

# Initialize the tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Text example

text = "I love machine learning!"

# Tokenize the text

tokens = tokenizer.tokenize(text)

print(f"Original text: {text}")

print(f"Tokenized text: {tokens}")

# Convert tokens to IDs

input_ids = tokenizer.convert_tokens_to_ids(tokens)

print(f"Token IDs: {input_ids}")

When you run this code, you’ll see three different representations of the same text as follows:

Original text: I love machine learning!
Tokenized text: ['i', 'love', 'machine', 'learning', '!']
Token IDs: [1045, 2293, 3698, 4083, 999]

Original text: I love machine learning!

Tokenized text: ['i', 'love', 'machine', 'learning', '!']

Token IDs: [1045, 2293, 3698, 4083, 999]

Let’s understand what’s happening:

The original text is your raw input text.
The tokenized text are words broken down into BERT’s vocabulary units. It seems like words are defined by boundaries between alphabets and non-alphabets but a tokenizer may implement a different algorithm.
The token IDs are integers that the BERT model actually processes. It is important to remember that BERT is a neural network model that can process only numerical input. Tokenized strings need to be converted into a numerical form before the model can use it.

BERT the deep learning model need not only to understand what your input text is, but also the structure of your input. To illustrate, see the following:

...

# Complete tokenization with special tokens
encoded = tokenizer.encode_plus(
    text,
    add_special_tokens=True,
    padding="max_length",
    max_length=10,
    return_tensors="pt"
)

print("Full encoded sequence:")
for token_id, token in zip(
    encoded["input_ids"][0],
    tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])
):
    print(f"{token}: {token_id}")

...

# Complete tokenization with special tokens

encoded = tokenizer.encode_plus(

text,

add_special_tokens=True,

padding="max_length",

max_length=10,

return_tensors="pt"

)

print("Full encoded sequence:")

for token_id, token in zip(

encoded["input_ids"][0],

tokenizer.convert_ids_to_tokens(encoded["input_ids"][0])

print(f"{token}: {token_id}")

This shows:

Full encoded sequence:
[CLS]: 101
i: 1045
love: 2293
machine: 3698
learning: 4083
!: 999
[SEP]: 102
[PAD]: 0
[PAD]: 0
[PAD]: 0

Full encoded sequence:

[CLS]: 101

i: 1045

love: 2293

machine: 3698

learning: 4083

!: 999

[SEP]: 102

[PAD]: 0

From the above, you can see that BERT tokenizer adds:

[CLS] token at the start (used for classification tasks)
[SEP] token at the end (marks sentence boundaries)
Padding tokens [PAD] (optional, if padding argument is set to make all sequences the same length)

Your First BERT Project

BERT is a model for multiple purposes. In the transformers library, you can refer to BERT by name and let the library to load and configure the model automatically.

Let’s start with the simplest BERT application: sentiment classification:

import torch
from transformers import pipeline

# Create a sentiment analysis pipeline
sentiment_analyzer = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

# Test text
text = "I absolutely love this product! Would buy again."

# Get the sentiment
result = sentiment_analyzer(text)
print(f"Sentiment: {result[0]['label']}")
print(f"Confidence: {result[0]['score']:.4f}")

import torch

from transformers import pipeline

# Create a sentiment analysis pipeline

sentiment_analyzer = pipeline(

"sentiment-analysis",

model="distilbert-base-uncased-finetuned-sst-2-english"

)

# Test text

text = "I absolutely love this product! Would buy again."

# Get the sentiment

result = sentiment_analyzer(text)

print(f"Sentiment: {result[0]['label']}")

print(f"Confidence: {result[0]['score']:.4f}")

Running this code will print:

Device set to use cuda:0
Sentiment: POSITIVE
Confidence: 0.9999

Device set to use cuda:0

Sentiment: POSITIVE

Confidence: 0.9999

If you run this code the first time, you will see progress bar printed, like the following:

config.json: 100%|███████████████████████████| 629/629 [00:00<00:00, 4.02MB/s]
model.safetensors: 100%|███████████████████| 268M/268M [00:07<00:00, 37.3MB/s]
tokenizer_config.json: 100%|████████████████| 48.0/48.0 [00:00<00:00, 555kB/s]
vocab.txt: 100%|███████████████████████████| 232k/232k [00:00<00:00, 10.1MB/s]

config.json: 100%|███████████████████████████| 629/629 [00:00<00:00, 4.02MB/s]

model.safetensors: 100%|███████████████████| 268M/268M [00:07<00:00, 37.3MB/s]

tokenizer_config.json: 100%|████████████████| 48.0/48.0 [00:00<00:00, 555kB/s]

vocab.txt: 100%|███████████████████████████| 232k/232k [00:00<00:00, 10.1MB/s]

This is because BERT as a deep learning model has its code implemented in the transformers library, but the weight should be downloaded from Hugging Face Hub on demand. This progress bar is printed when the model is downloaded into your local cache.

The code above created a pipeline using a high-level API, that (a) handles all tokenization from the input, (b) pass on tokenized input to the model, and (c) convert model output back to human-readable result. Pretrained model will be downloaded if necessary when the pipeline is created.

The model used is "distilbert-base-uncased-finetuned-sst-2-english", or the uncased version of DistilBERT. It runs faster than the original BERT and use less memory while maintains similar accuracy as BERT. It is uncased hence the input text is case insensitive. This model is trained on English data, and you should not expect it can understand another language.

It is a sentiment model and its output is either “POSITIVE” or “NEGATIVE”, describing the tone of the input text, with a confidence level between 0 and 1.

Real-World Projects with BERT

The code snippets above works but not robust for production use. Let’s enhance it:

Not using the pipeline, but call each component directly
Limit the input length and truncate if too long, to prevent overwhelming the computer
Use GPU if available
Provide more data, e.g., the confidence in both “POSITIVE” and “NEGATIVE”

Below is the modified code, to make it into a Python class:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

class BERTSentimentAnalyzer:
    def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModelForSequenceClassification.from_pretrained(model_name)
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()
        self.labels = ['NEGATIVE', 'POSITIVE']

    def preprocess_text(self, text):
        # Remove extra whitespace and normalize
        text = ' '.join(text.split())

        # Tokenize with BERT-specific tokens
        inputs = self.tokenizer(
            text,
            add_special_tokens=True,
            max_length=512,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )

        # Move to GPU if available
        return {k: v.to(self.device) for k, v in inputs.items()}

    def predict(self, text):
        # Prepare text for model
        inputs = self.preprocess_text(text)

        # Get model predictions
        with torch.no_grad():
            outputs = self.model(**inputs)
            probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)

        # Convert to human-readable format
        prediction_dict = {
            'text': text,
            'sentiment': self.labels[probabilities.argmax().item()],
            'confidence': probabilities.max().item(),
            'probabilities': {
                label: prob.item()
                for label, prob in zip(self.labels, probabilities[0])
            }
        }
        return prediction_dict

import torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification

class BERTSentimentAnalyzer:

def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"):

self.tokenizer = AutoTokenizer.from_pretrained(model_name)

self.model = AutoModelForSequenceClassification.from_pretrained(model_name)

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

self.model.to(self.device)

self.model.eval()

self.labels = ['NEGATIVE', 'POSITIVE']

def preprocess_text(self, text):

# Remove extra whitespace and normalize

text = ' '.join(text.split())

# Tokenize with BERT-specific tokens

inputs = self.tokenizer(

text,

add_special_tokens=True,

max_length=512,

padding='max_length',

truncation=True,

return_tensors='pt'

)

# Move to GPU if available

return {k: v.to(self.device) for k, v in inputs.items()}

def predict(self, text):

# Prepare text for model

inputs = self.preprocess_text(text)

# Get model predictions

with torch.no_grad():

outputs = self.model(**inputs)

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)

# Convert to human-readable format

prediction_dict = {

'text': text,

'sentiment': self.labels[probabilities.argmax().item()],

'confidence': probabilities.max().item(),

'probabilities': {

label: prob.item()

for label, prob in zip(self.labels, probabilities[0])

}

return prediction_dict

Here the preprocess_text() and predict() functions are combined, but the workflow is the same: Let the tokenizer process the input as a text string into

At initialization, GPU is used if available, as indicated by torch.cuda.is_available(). The tokenizer and model are created using AutoTokenizer and AutoModelForSequenceClassification. According to the documentation of the specific model, the output are two values, NEGATIVE and POSITIVE. You set up the labels in this order.

At text preprocessing function preprocess_text(), text are cleaned with extra spaces and then tokenized with BERT’s special tokens ([CLS], [SEP]). Truncation of long input or padding of short input also applied at tokenization. The output of the tokenizer will be a dict of keys input_ids (a tensor of token IDs) and attention_mask (a tensor of 0 or 1, indicating whether a valid token is present at that location).

At the prediction logic predict(), it runs the model with the inputs (input_ids and attention_mask) and then converts the outputs to probabilities, and return the detailed prediction information.

Let’s test out this implementation:

...

def demonstrate_sentiment_analysis():
    # Initialize analyzer
    analyzer = BERTSentimentAnalyzer()

    # Test texts
    texts = [
        "This product completely transformed my workflow!",
        "Terrible experience, would not recommend.",
        "It's decent for the price, but nothing special."
    ]

    # Analyze each text
    for text in texts:
        result = analyzer.predict(text)
        print(f"\nText: {result['text']}")
        print(f"Sentiment: {result['sentiment']}")
        print(f"Confidence: {result['confidence']:.4f}")
        print("Detailed probabilities:")
        for label, prob in result['probabilities'].items():
            print(f"  {label}: {prob:.4f}")

# Running demonstration
demonstrate_sentiment_analysis()

...

def demonstrate_sentiment_analysis():

# Initialize analyzer

analyzer = BERTSentimentAnalyzer()

# Test texts

texts = [

"This product completely transformed my workflow!",

"Terrible experience, would not recommend.",

"It's decent for the price, but nothing special."

]

# Analyze each text

for text in texts:

result = analyzer.predict(text)

print(f"\nText: {result['text']}")

print(f"Sentiment: {result['sentiment']}")

print(f"Confidence: {result['confidence']:.4f}")

print("Detailed probabilities:")

for label, prob in result['probabilities'].items():

print(f" {label}: {prob:.4f}")

# Running demonstration

demonstrate_sentiment_analysis()

Here is what this code prints:

Text: This product completely transformed my workflow!
Sentiment: POSITIVE
Confidence: 0.9997
Detailed probabilities:
  NEGATIVE: 0.0003
  POSITIVE: 0.9997

Text: Terrible experience, would not recommend.
Sentiment: NEGATIVE
Confidence: 0.9934
Detailed probabilities:
  NEGATIVE: 0.9934
  POSITIVE: 0.0066

Text: It's decent for the price, but nothing special.
Sentiment: NEGATIVE
Confidence: 0.9897
Detailed probabilities:
  NEGATIVE: 0.9897
  POSITIVE: 0.0103

Text: This product completely transformed my workflow!

Sentiment: POSITIVE

Confidence: 0.9997

Detailed probabilities:

NEGATIVE: 0.0003

POSITIVE: 0.9997

Text: Terrible experience, would not recommend.

Sentiment: NEGATIVE

Confidence: 0.9934

Detailed probabilities:

NEGATIVE: 0.9934

POSITIVE: 0.0066

Text: It's decent for the price, but nothing special.

Sentiment: NEGATIVE

Confidence: 0.9897

Detailed probabilities:

NEGATIVE: 0.9897

POSITIVE: 0.0103

As you can see, the model accurately predicted sentiments from the provided statements.

Named Entity Recognition System

If you read the original paper of the BERT model, you will find that it is not designed for sentiment classification, but as a generic language model. It can be adapted to other use.

One example is using BERT for named entity recognition (NER). This is to identify proper nouns (names, organizations, locations) in text. It is a difficult problem because, unlike other words that you can have a dictionary to check whether it is a verb or a pronoun, the named entities usually not found in dictionary, so you cannot check with a look up table. Further, some named entities are multiple words, such as “European Union”, in which they should be identified together as one entity.

You can find a pretrained BERT NER model from Hugging Face Hub, too. Below is how you should modifying the previous code for NER:

import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification

class BERTNamedEntityRecognizer:
    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
        self.model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model.to(self.device)
        self.model.eval()

    def recognize_entities(self, text):
        # Tokenize input text
        inputs = self.tokenizer(
            text,
            add_special_tokens=True,
            return_tensors="pt",
            padding=True,
            truncation=True
        )

        # Move inputs to device
        inputs = {k: v.to(self.device) for k, v in inputs.items()}
        # print(inputs)

        # Get predictions
        with torch.no_grad():
            outputs = self.model(**inputs)
            predictions = outputs.logits.argmax(-1)

        # Convert predictions to entities
        tokens = self.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
        labels = [self.model.config.id2label[p.item()] for p in predictions[0]]
        # print(labels)

        # Extract entities
        entities = []
        current_entity = None

        for token, label in zip(tokens, labels):
            if label.startswith('B-'):
                if current_entity:
                    entities.append(current_entity)
                current_entity = {'type': label[2:], 'text': token}
            elif label.startswith('I-') and current_entity:
                if token.startswith('##'):
                    current_entity['text'] += token[2:]
                else:
                    current_entity['text'] += ' ' + token
            elif label == 'O':
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None

        if current_entity:
            entities.append(current_entity)

        return entities

import torch

from transformers import AutoTokenizer, AutoModelForTokenClassification

class BERTNamedEntityRecognizer:

def __init__(self):

self.tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")

self.model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

self.model.to(self.device)

self.model.eval()

def recognize_entities(self, text):

# Tokenize input text

inputs = self.tokenizer(

text,

add_special_tokens=True,

return_tensors="pt",

padding=True,

truncation=True

)

# Move inputs to device

inputs = {k: v.to(self.device) for k, v in inputs.items()}

# print(inputs)

# Get predictions

with torch.no_grad():

outputs = self.model(**inputs)

predictions = outputs.logits.argmax(-1)

# Convert predictions to entities

tokens = self.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

labels = [self.model.config.id2label[p.item()] for p in predictions[0]]

# print(labels)

# Extract entities

entities = []

current_entity = None

for token, label in zip(tokens, labels):

if label.startswith('B-'):

if current_entity:

entities.append(current_entity)

current_entity = {'type': label[2:], 'text': token}

elif label.startswith('I-') and current_entity:

if token.startswith('##'):

current_entity['text'] += token[2:]

else:

current_entity['text'] += ' ' + token

elif label == 'O':

if current_entity:

entities.append(current_entity)

current_entity = None

if current_entity:

entities.append(current_entity)

return entities

The name of the model "dslim/bert-base-NER" is what you can search to find out in the Hugging Face Hub. It is a model based on BERT and specialized for NER.

In this class, the function recognize_entities() combines the tokenization and the model inference. If you check the output of the tokenizer, you will find that a dict of three tensors are produced, under the keys input_ids, token_type_ids, and attention_mask. This is different from the previous example and hence you should use a matching tokenizer for the model.

The model is created with AutoModelForTokenClassification, which token classification means to tag each input token as its output. Named entity recognition is achieved using the B-I-O tagging scheme, i.e., beginning-inside-outside. Beginning is the token that a named entity starts. If there are multiple tokens belong to the same named entity, the second token and thereafter will be tagged as “inside”. The token that are not a part of a named entity will be tagged as “outside”.

The list labels converted from the model prediction will be like:

['O', 'B-ORG', 'O', 'B-PER', 'I-PER', 'O', 'O', 'B-MISC', ...]

1	['O', 'B-ORG', 'O', 'B-PER', 'I-PER', 'O', 'O', 'B-MISC', ...]

In which the prefix B-, I- indicates the first and subsequent tokens of an entity. The suffix ORG or PER tells the type of the entity, e.g., an organization, a person, a location, or other miscellaneous entities.

The for-loop just enumerates all the named entities found. Since the model’s tokenizer may create subword tokens, a token that starts with ## will be considered as a subword and combined with the previous token at output.

Let’s see it in action:

def demonstrate_ner():
    # Initialize recognizer
    ner = BERTNamedEntityRecognizer()

    # Example text
    text = """
    Apple CEO Tim Cook announced new AI features at their headquarters 
    in Cupertino, California. Microsoft and Google are also investing 
    heavily in artificial intelligence research.
    """

    # Get entities
    entities = ner.recognize_entities(text)

    # Display results
    print("Found entities:")
    for entity in entities:
        print(f"- {entity['text']} ({entity['type']})")

# Running demonstration
demonstrate_ner()

def demonstrate_ner():

# Initialize recognizer

ner = BERTNamedEntityRecognizer()

# Example text

text = """

Apple CEO Tim Cook announced new AI features at their headquarters

in Cupertino, California. Microsoft and Google are also investing

heavily in artificial intelligence research.

"""

# Get entities

entities = ner.recognize_entities(text)

# Display results

print("Found entities:")

for entity in entities:

print(f"- {entity['text']} ({entity['type']})")

# Running demonstration

demonstrate_ner()

Here is what you’ll get after running this code.

Found entities:
- Apple (ORG)
- Tim Cook (PER)
- AI (MISC)
- Cupertino (LOC)
- California (LOC)
- Microsoft (ORG)
- Google (ORG)

Found entities:

- Apple (ORG)

- Tim Cook (PER)

- AI (MISC)

- Cupertino (LOC)

- California (LOC)

- Microsoft (ORG)

- Google (ORG)

The model accurately recognized the entities. Not only multi-word entities are identified, but also their nature.

Summary

In this comprehensive tutorial, you learned about the BERT model and it’s applications. Particularly, you learned:

What’s BERT and how it processes input and output text.
How to setup BERT and build real-world applications with a few lines
of code without knowing much about the model architecture.
How to build a sentiment analyzer with BERT.
How to build a Named Entity Recognition (NER) system with BERT.

7 Responses to A Complete Introduction to Using BERT Models

Gaurav February 6, 2025 at 1:55 pm #

Excellent explanation and very nicely detailed examples.Thanks a ton!!

- James Carmichael February 7, 2025 at 9:01 am #
  
  You are very welcome, Gaurav!
  
Alberto February 9, 2025 at 1:41 am #

Excellent explanation and very nicely detailed examples, as Gaurav said above. One improvement I would add is the required PC specs to run the examples. Just to allow users to decide, from the start, whether his/her PC will be able to run everything.

- James Carmichael February 9, 2025 at 9:11 am #
  
  Thank you for your feedback Alberto!
  
Kai February 11, 2025 at 7:30 am #

Nicely done. Works excellent by simply running top-to-bottom in a Jupyter notebook as provided without any issues. I considered this an hour well spent.
Please edit the explanations again; one reason why I come back to this site and its publications is the overall quality. And in my opinion that does include the English language and not just the programming language.

- James Carmichael February 11, 2025 at 8:57 am #
  
  Hi Kai…I appreciate your detailed feedback! If you’d like, I can go through the explanations in “A Complete Introduction to Using BERT Models” and refine them for clarity, conciseness, and grammatical accuracy. Let me know if you have any specific sections you want me to focus on, or if you’d prefer a complete revision of the explanations.
  
  - Kai February 12, 2025 at 3:39 am #
    
    Hi James,
    just read through the explanatory passages and you see what I mean, the further down you get the worse it gets.
    
    Look, when we talk about NLP, Data Science, and sentiment analysis in language then we can all agree that language and grammar is very important. hence the accompanying text should be as well.
    At the same time (and that is very important), we never want to stop people from publishing just because their command of the English language is not 100%. Science would be at a dead end! And that is why we have editors. 🙂
    Your publications obviously go through a (I would assume rigorous) editing process, so should these blog posts. That’s all I mean.
    
    (this is meant as a private, general comment to you as the editor, I do not see a need to publish this as a comment on this blog post)

Navigation

A Complete Introduction to Using BERT Models

Overview

Why BERT Matters

Understanding BERT’s Input/Output Process

Your First BERT Project

Real-World Projects with BERT

Named Entity Recognition System

Summary

Want to Use Powerful Language Models in Your NLP Projects?

Run State-of-the-Art Models on Your Own Machine

Finally Bring Advanced NLP to
Your Own Projects

More On This Topic

7 Responses to A Complete Introduction to Using BERT Models

Leave a Reply Click here to cancel reply.

Navigation

Overview

Why BERT Matters

Understanding BERT’s Input/Output Process

Your First BERT Project

Real-World Projects with BERT

Named Entity Recognition System

Summary

Want to Use Powerful Language Models in Your NLP Projects?

Run State-of-the-Art Models on Your Own Machine

Finally Bring Advanced NLP to Your Own Projects

More On This Topic

7 Responses to A Complete Introduction to Using BERT Models

Leave a Reply Click here to cancel reply.

Finally Bring Advanced NLP to
Your Own Projects