BERT model is one of the first Transformer application in natural language processing (NLP). Its architecture is simple, but sufficiently do its job in the tasks that it is intended to. In the following, we’ll explore BERT models from the ground up — understanding what they are, how they work, and most importantly, how to use them practically in your projects. We’ll focus on using pre-trained models through the Hugging Face Transformers library, making advanced NLP accessible without requiring deep learning expertise.
Kick-start your project with my book NLP with Hugging Face Transformers. It provides self-study tutorials with working code.
Let’s get started.

A Complete Introduction to Using BERT Models
Photo by Taton Moïse. Some rights reserved.
Overview
This post is divided into five parts; they are:
- Why BERT Matters
- Understanding BERT’s Input/Output Process
- Your First BERT Project
- Real-World Projects with BERT
- Named Entity Recognition System
Why BERT Matters
Imagine you’re teaching someone a new language. As they learn, they need to understand words not just in isolation, but in context. The word “bank” means something completely different in “river bank” versus “bank account.” This is exactly what makes BERT special — it understands language in context, just like humans do.
BERT has revolutionized how computers understand language by:
- Processing text bidirectionally (both left-to-right and right-to-left simultaneously)
- Understanding context-dependent meanings
- Capturing complex relationships between words
Let’s use a simple example to understand this: “The patient needs to be patient to recover.” A traditional model might get confused by the words “patient” as noun and “patient” as adjectie. BERT, however, understands the different meanings based on their context in the sentence.
Understanding BERT’s Input/Output Process
The code in this post uses the Hugging Face transformers library. Let’s install that with Python’s pip command:
|
1 |
pip install transformers torch |
You should think of BERT as a highly skilled translator who needs text formatted in a specific way. Let’s break down this process:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from transformers import BertTokenizer # Initialize the tokenizer tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') # Text example text = "I love machine learning!" # Tokenize the text tokens = tokenizer.tokenize(text) print(f"Original text: {text}") print(f"Tokenized text: {tokens}") # Convert tokens to IDs input_ids = tokenizer.convert_tokens_to_ids(tokens) print(f"Token IDs: {input_ids}") |
When you run this code, you’ll see three different representations of the same text as follows:
|
1 2 3 |
Original text: I love machine learning! Tokenized text: ['i', 'love', 'machine', 'learning', '!'] Token IDs: [1045, 2293, 3698, 4083, 999] |
Let’s understand what’s happening:
- The original text is your raw input text.
- The tokenized text are words broken down into BERT’s vocabulary units. It seems like words are defined by boundaries between alphabets and non-alphabets but a tokenizer may implement a different algorithm.
- The token IDs are integers that the BERT model actually processes. It is important to remember that BERT is a neural network model that can process only numerical input. Tokenized strings need to be converted into a numerical form before the model can use it.
BERT the deep learning model need not only to understand what your input text is, but also the structure of your input. To illustrate, see the following:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
... # Complete tokenization with special tokens encoded = tokenizer.encode_plus( text, add_special_tokens=True, padding="max_length", max_length=10, return_tensors="pt" ) print("Full encoded sequence:") for token_id, token in zip( encoded["input_ids"][0], tokenizer.convert_ids_to_tokens(encoded["input_ids"][0]) ): print(f"{token}: {token_id}") |
This shows:
|
1 2 3 4 5 6 7 8 9 10 11 |
Full encoded sequence: [CLS]: 101 i: 1045 love: 2293 machine: 3698 learning: 4083 !: 999 [SEP]: 102 [PAD]: 0 [PAD]: 0 [PAD]: 0 |
From the above, you can see that BERT tokenizer adds:
[CLS]token at the start (used for classification tasks)[SEP]token at the end (marks sentence boundaries)- Padding tokens
[PAD](optional, ifpaddingargument is set to make all sequences the same length)
Your First BERT Project
BERT is a model for multiple purposes. In the transformers library, you can refer to BERT by name and let the library to load and configure the model automatically.
Let’s start with the simplest BERT application: sentiment classification:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import torch from transformers import pipeline # Create a sentiment analysis pipeline sentiment_analyzer = pipeline( "sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english" ) # Test text text = "I absolutely love this product! Would buy again." # Get the sentiment result = sentiment_analyzer(text) print(f"Sentiment: {result[0]['label']}") print(f"Confidence: {result[0]['score']:.4f}") |
Running this code will print:
|
1 2 3 |
Device set to use cuda:0 Sentiment: POSITIVE Confidence: 0.9999 |
If you run this code the first time, you will see progress bar printed, like the following:
|
1 2 3 4 |
config.json: 100%|███████████████████████████| 629/629 [00:00<00:00, 4.02MB/s] model.safetensors: 100%|███████████████████| 268M/268M [00:07<00:00, 37.3MB/s] tokenizer_config.json: 100%|████████████████| 48.0/48.0 [00:00<00:00, 555kB/s] vocab.txt: 100%|███████████████████████████| 232k/232k [00:00<00:00, 10.1MB/s] |
This is because BERT as a deep learning model has its code implemented in the transformers library, but the weight should be downloaded from Hugging Face Hub on demand. This progress bar is printed when the model is downloaded into your local cache.
The code above created a pipeline using a high-level API, that (a) handles all tokenization from the input, (b) pass on tokenized input to the model, and (c) convert model output back to human-readable result. Pretrained model will be downloaded if necessary when the pipeline is created.
The model used is "distilbert-base-uncased-finetuned-sst-2-english", or the uncased version of DistilBERT. It runs faster than the original BERT and use less memory while maintains similar accuracy as BERT. It is uncased hence the input text is case insensitive. This model is trained on English data, and you should not expect it can understand another language.
It is a sentiment model and its output is either “POSITIVE” or “NEGATIVE”, describing the tone of the input text, with a confidence level between 0 and 1.
Real-World Projects with BERT
The code snippets above works but not robust for production use. Let’s enhance it:
- Not using the pipeline, but call each component directly
- Limit the input length and truncate if too long, to prevent overwhelming the computer
- Use GPU if available
- Provide more data, e.g., the confidence in both “POSITIVE” and “NEGATIVE”
Below is the modified code, to make it into a Python class:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification class BERTSentimentAnalyzer: def __init__(self, model_name="distilbert-base-uncased-finetuned-sst-2-english"): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModelForSequenceClassification.from_pretrained(model_name) self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.model.to(self.device) self.model.eval() self.labels = ['NEGATIVE', 'POSITIVE'] def preprocess_text(self, text): # Remove extra whitespace and normalize text = ' '.join(text.split()) # Tokenize with BERT-specific tokens inputs = self.tokenizer( text, add_special_tokens=True, max_length=512, padding='max_length', truncation=True, return_tensors='pt' ) # Move to GPU if available return {k: v.to(self.device) for k, v in inputs.items()} def predict(self, text): # Prepare text for model inputs = self.preprocess_text(text) # Get model predictions with torch.no_grad(): outputs = self.model(**inputs) probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1) # Convert to human-readable format prediction_dict = { 'text': text, 'sentiment': self.labels[probabilities.argmax().item()], 'confidence': probabilities.max().item(), 'probabilities': { label: prob.item() for label, prob in zip(self.labels, probabilities[0]) } } return prediction_dict |
Here the preprocess_text() and predict() functions are combined, but the workflow is the same: Let the tokenizer process the input as a text string into
At initialization, GPU is used if available, as indicated by torch.cuda.is_available(). The tokenizer and model are created using AutoTokenizer and AutoModelForSequenceClassification. According to the documentation of the specific model, the output are two values, NEGATIVE and POSITIVE. You set up the labels in this order.
At text preprocessing function preprocess_text(), text are cleaned with extra spaces and then tokenized with BERT’s special tokens ([CLS], [SEP]). Truncation of long input or padding of short input also applied at tokenization. The output of the tokenizer will be a dict of keys input_ids (a tensor of token IDs) and attention_mask (a tensor of 0 or 1, indicating whether a valid token is present at that location).
At the prediction logic predict(), it runs the model with the inputs (input_ids and attention_mask) and then converts the outputs to probabilities, and return the detailed prediction information.
Let’s test out this implementation:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
... def demonstrate_sentiment_analysis(): # Initialize analyzer analyzer = BERTSentimentAnalyzer() # Test texts texts = [ "This product completely transformed my workflow!", "Terrible experience, would not recommend.", "It's decent for the price, but nothing special." ] # Analyze each text for text in texts: result = analyzer.predict(text) print(f"\nText: {result['text']}") print(f"Sentiment: {result['sentiment']}") print(f"Confidence: {result['confidence']:.4f}") print("Detailed probabilities:") for label, prob in result['probabilities'].items(): print(f" {label}: {prob:.4f}") # Running demonstration demonstrate_sentiment_analysis() |
Here is what this code prints:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
Text: This product completely transformed my workflow! Sentiment: POSITIVE Confidence: 0.9997 Detailed probabilities: NEGATIVE: 0.0003 POSITIVE: 0.9997 Text: Terrible experience, would not recommend. Sentiment: NEGATIVE Confidence: 0.9934 Detailed probabilities: NEGATIVE: 0.9934 POSITIVE: 0.0066 Text: It's decent for the price, but nothing special. Sentiment: NEGATIVE Confidence: 0.9897 Detailed probabilities: NEGATIVE: 0.9897 POSITIVE: 0.0103 |
As you can see, the model accurately predicted sentiments from the provided statements.
Named Entity Recognition System
If you read the original paper of the BERT model, you will find that it is not designed for sentiment classification, but as a generic language model. It can be adapted to other use.
One example is using BERT for named entity recognition (NER). This is to identify proper nouns (names, organizations, locations) in text. It is a difficult problem because, unlike other words that you can have a dictionary to check whether it is a verb or a pronoun, the named entities usually not found in dictionary, so you cannot check with a look up table. Further, some named entities are multiple words, such as “European Union”, in which they should be identified together as one entity.
You can find a pretrained BERT NER model from Hugging Face Hub, too. Below is how you should modifying the previous code for NER:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 |
import torch from transformers import AutoTokenizer, AutoModelForTokenClassification class BERTNamedEntityRecognizer: def __init__(self): self.tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER") self.model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER") self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu") self.model.to(self.device) self.model.eval() def recognize_entities(self, text): # Tokenize input text inputs = self.tokenizer( text, add_special_tokens=True, return_tensors="pt", padding=True, truncation=True ) # Move inputs to device inputs = {k: v.to(self.device) for k, v in inputs.items()} # print(inputs) # Get predictions with torch.no_grad(): outputs = self.model(**inputs) predictions = outputs.logits.argmax(-1) # Convert predictions to entities tokens = self.tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) labels = [self.model.config.id2label[p.item()] for p in predictions[0]] # print(labels) # Extract entities entities = [] current_entity = None for token, label in zip(tokens, labels): if label.startswith('B-'): if current_entity: entities.append(current_entity) current_entity = {'type': label[2:], 'text': token} elif label.startswith('I-') and current_entity: if token.startswith('##'): current_entity['text'] += token[2:] else: current_entity['text'] += ' ' + token elif label == 'O': if current_entity: entities.append(current_entity) current_entity = None if current_entity: entities.append(current_entity) return entities |
The name of the model "dslim/bert-base-NER" is what you can search to find out in the Hugging Face Hub. It is a model based on BERT and specialized for NER.
In this class, the function recognize_entities() combines the tokenization and the model inference. If you check the output of the tokenizer, you will find that a dict of three tensors are produced, under the keys input_ids, token_type_ids, and attention_mask. This is different from the previous example and hence you should use a matching tokenizer for the model.
The model is created with AutoModelForTokenClassification, which token classification means to tag each input token as its output. Named entity recognition is achieved using the B-I-O tagging scheme, i.e., beginning-inside-outside. Beginning is the token that a named entity starts. If there are multiple tokens belong to the same named entity, the second token and thereafter will be tagged as “inside”. The token that are not a part of a named entity will be tagged as “outside”.
The list labels converted from the model prediction will be like:
|
1 |
['O', 'B-ORG', 'O', 'B-PER', 'I-PER', 'O', 'O', 'B-MISC', ...] |
In which the prefix B-, I- indicates the first and subsequent tokens of an entity. The suffix ORG or PER tells the type of the entity, e.g., an organization, a person, a location, or other miscellaneous entities.
The for-loop just enumerates all the named entities found. Since the model’s tokenizer may create subword tokens, a token that starts with ## will be considered as a subword and combined with the previous token at output.
Let’s see it in action:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
def demonstrate_ner(): # Initialize recognizer ner = BERTNamedEntityRecognizer() # Example text text = """ Apple CEO Tim Cook announced new AI features at their headquarters in Cupertino, California. Microsoft and Google are also investing heavily in artificial intelligence research. """ # Get entities entities = ner.recognize_entities(text) # Display results print("Found entities:") for entity in entities: print(f"- {entity['text']} ({entity['type']})") # Running demonstration demonstrate_ner() |
Here is what you’ll get after running this code.
|
1 2 3 4 5 6 7 8 |
Found entities: - Apple (ORG) - Tim Cook (PER) - AI (MISC) - Cupertino (LOC) - California (LOC) - Microsoft (ORG) - Google (ORG) |
The model accurately recognized the entities. Not only multi-word entities are identified, but also their nature.
Summary
In this comprehensive tutorial, you learned about the BERT model and it’s applications. Particularly, you learned:
- What’s BERT and how it processes input and output text.
- How to setup BERT and build real-world applications with a few lines
of code without knowing much about the model architecture. - How to build a sentiment analyzer with BERT.
- How to build a Named Entity Recognition (NER) system with BERT.







Excellent explanation and very nicely detailed examples.Thanks a ton!!
You are very welcome, Gaurav!
Excellent explanation and very nicely detailed examples, as Gaurav said above. One improvement I would add is the required PC specs to run the examples. Just to allow users to decide, from the start, whether his/her PC will be able to run everything.
Thank you for your feedback Alberto!
Nicely done. Works excellent by simply running top-to-bottom in a Jupyter notebook as provided without any issues. I considered this an hour well spent.
Please edit the explanations again; one reason why I come back to this site and its publications is the overall quality. And in my opinion that does include the English language and not just the programming language.
Hi Kai…I appreciate your detailed feedback! If you’d like, I can go through the explanations in “A Complete Introduction to Using BERT Models” and refine them for clarity, conciseness, and grammatical accuracy. Let me know if you have any specific sections you want me to focus on, or if you’d prefer a complete revision of the explanations.
Hi James,
just read through the explanatory passages and you see what I mean, the further down you get the worse it gets.
Look, when we talk about NLP, Data Science, and sentiment analysis in language then we can all agree that language and grammar is very important. hence the accompanying text should be as well.
At the same time (and that is very important), we never want to stop people from publishing just because their command of the English language is not 100%. Science would be at a dead end! And that is why we have editors. 🙂
Your publications obviously go through a (I would assume rigorous) editing process, so should these blog posts. That’s all I mean.
(this is meant as a private, general comment to you as the editor, I do not see a need to publish this as a comment on this blog post)