A Gentle Introduction to Calculating the BLEU Score for Text in Python

By Jason Brownlee on December 19, 2019 in Deep Learning for Natural Language Processing 114

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.

Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.

In this tutorial, you will discover the BLEU score for evaluating and scoring candidate text using the NLTK library in Python.

After completing this tutorial, you will know:

A gentle introduction to the BLEU score and an intuition for what is being calculated.
How you can calculate BLEU scores in Python using the NLTK library for sentences and documents.
How you can use a suite of small examples to develop an intuition for how differences between a candidate and reference text impact the final BLEU score.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

May/2019: Updated to reflect changes to the API in NLTK 3.4.1+.

A Gentle Introduction to Calculating the BLEU Score for Text in Python
Photo by Bernard Spragg. NZ, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

Bilingual Evaluation Understudy Score
Calculate BLEU Scores
Cumulative and Individual BLEU Scores
Worked Examples

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Bilingual Evaluation Understudy Score

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

The score was developed for evaluating the predictions made by automatic machine translation systems. It is not perfect, but does offer 5 compelling benefits:

It is quick and inexpensive to calculate.
It is easy to understand.
It is language independent.
It correlates highly with human evaluation.
It has been widely adopted.

The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation“.

The approach works by counting matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order.

The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

— BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

The counting of matching n-grams is modified to ensure that it takes the occurrence of the words in the reference text into account, not rewarding a candidate translation that generates an abundance of reasonable words. This is referred to in the paper as modified n-gram precision.

Unfortunately, MT systems can overgenerate “reasonable” words, resulting in improbable, but high-precision, translations […] Intuitively the problem is clear: a reference word should be considered exhausted after a matching candidate word is identified. We formalize this intuition as the modified unigram precision.

— BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

The score is for comparing sentences, but a modified version that normalizes n-grams by their occurrence is also proposed for better scoring blocks of multiple sentences.

We first compute the n-gram matches sentence by sentence. Next, we add the clipped n-gram counts for all the candidate sentences and divide by the number of candidate n-grams in the test corpus to compute a modified precision score, pn, for the entire test corpus.

— BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

A perfect score is not possible in practice as a translation would have to match the reference exactly. This is not even possible by human translators. The number and quality of the references used to calculate the BLEU score means that comparing scores across datasets can be troublesome.

The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1. […] on a test corpus of about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references.

— BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

In addition to translation, we can use the BLEU score for other language generation problems with deep learning methods such as:

Language generation.
Image caption generation.
Text summarization.
Speech recognition.

And much more.

Calculate BLEU Scores

The Python Natural Language Toolkit library, or NLTK, provides an implementation of the BLEU score that you can use to evaluate your generated text against a reference.

Sentence BLEU Score

NLTK provides the sentence_bleu() function for evaluating a candidate sentence against one or more reference sentences.

The reference sentences must be provided as a list of sentences where each reference is a list of tokens. The candidate sentence is provided as a list of tokens. For example:

from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate)
print(score)

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test'], ['this', 'is' 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate)

print(score)

Running this example prints a perfect score as the candidate matches one of the references exactly.

1.0

1.0

Corpus BLEU Score

NLTK also provides a function called corpus_bleu() for calculating the BLEU score for multiple sentences such as a paragraph or a document.

The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens. The candidate documents must be specified as a list where each document is a list of tokens, e.g. a list of lists of tokens.

This is a little confusing; here is an example of two references for one document.

# two references for one document
from nltk.translate.bleu_score import corpus_bleu
references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]
candidates = [['this', 'is', 'a', 'test']]
score = corpus_bleu(references, candidates)
print(score)

# two references for one document

from nltk.translate.bleu_score import corpus_bleu

references = [[['this', 'is', 'a', 'test'], ['this', 'is' 'test']]]

candidates = [['this', 'is', 'a', 'test']]

score = corpus_bleu(references, candidates)

print(score)

Running the example prints a perfect score as before.

1.0

1.0

Cumulative and Individual BLEU Scores

The BLEU score calculations in NLTK allow you to specify the weighting of different n-grams in the calculation of the BLEU score.

This gives you the flexibility to calculate different types of BLEU score, such as individual and cumulative n-gram scores.

Let’s take a look.

Individual N-Gram Scores

An individual N-gram score is the evaluation of just matching grams of a specific order, such as single words (1-gram) or word pairs (2-gram or bigram).

The weights are specified as a tuple where each index refers to the gram order. To calculate the BLEU score only for 1-gram matches, you can specify a weight of 1 for 1-gram and 0 for 2, 3 and 4 (1, 0, 0, 0). For example:

# 1-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))
print(score)

# 1-gram individual BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate, weights=(1, 0, 0, 0))

print(score)

Running this example prints a score of 0.5.

0.75

0.75

We can repeat this example for individual n-grams from 1 to 4 as follows:

# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'a', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

# n-gram individual BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'a', 'test']]

candidate = ['this', 'is', 'a', 'test']

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))

print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))

print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))

print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Running the example gives the following results.

Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 1.000000
Individual 4-gram: 1.000000

Individual 1-gram: 1.000000

Individual 2-gram: 1.000000

Individual 3-gram: 1.000000

Individual 4-gram: 1.000000

Although we can calculate the individual BLEU scores, this is not how the method was intended to be used and the scores do not carry a lot of meaning, or seem that interpretable.

Cumulative N-Gram Scores

Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.

By default, the sentence_bleu() and corpus_bleu() scores calculate the cumulative 4-gram BLEU score, also called BLEU-4.

The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores. For example:

# 4-gram cumulative BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score)

# 4-gram cumulative BLEU

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]

candidate = ['this', 'is', 'a', 'test']

score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))

print(score)

Running this example prints the following score:

1.0547686614863434e-154

1	1.0547686614863434e-154

The cumulative and individual 1-gram BLEU use the same weights, e.g. (1, 0, 0, 0). The 2-gram weights assign a 50% to each of 1-gram and 2-gram and the 3-gram weights are 33% for each of the 1, 2 and 3-gram scores.

Let’s make this concrete by calculating the cumulative scores for BLEU-1, BLEU-2, BLEU-3 and BLEU-4:

# cumulative BLEU scores
from nltk.translate.bleu_score import sentence_bleu
reference = [['this', 'is', 'small', 'test']]
candidate = ['this', 'is', 'a', 'test']
print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))
print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))
print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

# cumulative BLEU scores

from nltk.translate.bleu_score import sentence_bleu

reference = [['this', 'is', 'small', 'test']]

candidate = ['this', 'is', 'a', 'test']

print('Cumulative 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))

print('Cumulative 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0)))

print('Cumulative 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0)))

print('Cumulative 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))

Running the example prints the following scores. They are quite different and more expressive than the

They are quite different and more expressive than the standalone individual n-gram scores.

Cumulative 1-gram: 0.750000
Cumulative 2-gram: 0.500000
Cumulative 3-gram: 0.000000
Cumulative 4-gram: 0.000000

Cumulative 1-gram: 0.750000

Cumulative 2-gram: 0.500000

Cumulative 3-gram: 0.000000

Cumulative 4-gram: 0.000000

It is common to report the cumulative BLEU-1 to BLEU-4 scores when describing the skill of a text generation system.

Worked Examples

In this section, we try to develop further intuition for the BLEU score with some examples.

We work at the sentence level with a single reference sentence of the following:

the quick brown fox jumped over the lazy dog

First, let’s look at a perfect score.

# prefect match
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

# prefect match

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

score = sentence_bleu(reference, candidate)

print(score)

Running the example prints a perfect match.

1.0

1.0

Next, let’s change one word, ‘quick‘ to ‘fast‘.

# one word different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

# one word different

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']

score = sentence_bleu(reference, candidate)

print(score)

This result is a slight drop in score.

0.7506238537503395

1	0.7506238537503395

Try changing two words, both ‘quick‘ to ‘fast‘ and ‘lazy‘ to ‘sleepy‘.

# two words different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']
score = sentence_bleu(reference, candidate)
print(score)

# two words different

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'fast', 'brown', 'fox', 'jumped', 'over', 'the', 'sleepy', 'dog']

score = sentence_bleu(reference, candidate)

print(score)

Running the example, we can see a linear drop in skill.

0.4854917717073234

1	0.4854917717073234

What about if all words are different in the candidate?

# all words different
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']
score = sentence_bleu(reference, candidate)
print(score)

# all words different

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i']

score = sentence_bleu(reference, candidate)

print(score)

We get the worse possible score.

0.0

0.0

Now, let’s try a candidate that has fewer words than the reference (e.g. drop the last two words), but the words are all correct.

# shorter candidate
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the']
score = sentence_bleu(reference, candidate)
print(score)

# shorter candidate

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the']

score = sentence_bleu(reference, candidate)

print(score)

The score is much like the score when two words were wrong above.

0.7514772930752859

1	0.7514772930752859

How about if we make the candidate two words longer than the reference?

# longer candidate
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'from', 'space']
score = sentence_bleu(reference, candidate)
print(score)

# longer candidate

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog', 'from', 'space']

score = sentence_bleu(reference, candidate)

print(score)

Again, we can see that our intuition holds and the score is something like “two words wrong“.

0.7860753021519787

1	0.7860753021519787

Finally, let’s compare a candidate that is way too short: only two words in length.

# very short
from nltk.translate.bleu_score import sentence_bleu
reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]
candidate = ['the', 'quick']
score = sentence_bleu(reference, candidate)
print(score)

# very short

from nltk.translate.bleu_score import sentence_bleu

reference = [['the', 'quick', 'brown', 'fox', 'jumped', 'over', 'the', 'lazy', 'dog']]

candidate = ['the', 'quick']

score = sentence_bleu(reference, candidate)

print(score)

Running this example first prints a warning message indicating that the 3-gram and above part of the evaluation (up to 4-gram) cannot be performed. This is fair given we only have 2-grams to work with in the candidate.

UserWarning:
Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
  warnings.warn(_msg)

UserWarning:

Corpus/Sentence contains 0 counts of 3-gram overlaps.

BLEU scores might be undesirable; use SmoothingFunction().

warnings.warn(_msg)

Next, we can a score that is very low indeed.

0.0301973834223185

1	0.0301973834223185

I encourage you to continue to play with examples.

The math is pretty simple and I would also encourage you to read the paper and explore calculating the sentence-level score yourself in a spreadsheet.

Summary

In this tutorial, you discovered the BLEU score for evaluating and scoring candidate text to reference text in machine translation and other language generation tasks.

Specifically, you learned:

A gentle introduction to the BLEU score and an intuition for what is being calculated.
How you can calculate BLEU scores in Python using the NLTK library for sentences and documents.
How to can use a suite of small examples to develop an intuition for how differences between a candidate and reference text impact the final BLEU score.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

114 Responses to A Gentle Introduction to Calculating the BLEU Score for Text in Python

ngc December 25, 2017 at 10:08 am #

Good day Dr. Brownlee, I am wondering if I can use BLEU as a criteria for early stopping?

Reply
- Jason Brownlee December 26, 2017 at 5:12 am #
  
  Sure.
  
  Reply
  - Bashayer January 7, 2020 at 1:41 am #
    
    Please how can i down load bleu score
    
    How can i down load bleu score
    
    Reply
    - Jason Brownlee January 7, 2020 at 7:24 am #
      
      Sorry, I don’t follow. Can you elaborate?
      
      Reply
Sasikanth January 8, 2018 at 5:01 pm #

Hello Jason,

wonderful to learn about BLEU (Bilingual Evaluation Understudy). Is there a package in R for this?

thanks

Reply
- Jason Brownlee January 9, 2018 at 5:24 am #
  
  I don’t know, sorry.
  
  Reply
Davin Chern January 31, 2018 at 6:34 pm #

Hi Jason,

Thanks for your excellent introduction about BLEU.

When I try to calculate the BLEU scores for multiple sentences with corpus_bleu(), I found something strange.

Suppose I have a paragraph with two sentences, and I try to translate them both, the following are two cases:

case 1:
references = [[[‘a’, ‘b’, ‘c’, ‘d’]], [[‘e’, ‘f’, ‘g’]]]
candidates = [[‘a’, ‘b’, ‘c’, ‘d’], [‘e’, ‘f’, ‘g’]]
score = corpus_bleu(references, candidates)

case 2:
references = [[[‘a’, ‘b’, ‘c’, ‘d’, ‘x’]], [[‘e’, ‘f’, ‘g’, ‘y’]]]
candidates = [[‘a’, ‘b’, ‘c’, ‘d’, ‘x’], [‘e’, ‘f’, ‘g’, ‘y’]]
score = corpus_bleu(references, candidates)

I assume both should give me a result of 1.0, but only the second does, while the first is 0.84. Actually when both sentences have a length of 4 or above, the answer is always 1.0, so I think it is because the second sentence of case 1 has no 4-gram.

In practice, when dealing with sentence whose length is smaller than 4, do we have to make corpus_bleu() ignore the redundant n-gram cases by setting appropriate weights?

I appreciate your help!

Reply
- Jason Brownlee February 1, 2018 at 7:18 am #
  
  Yes, ideally. I’d recommend reporting 1,2,3,4 ngram scores separately as well.
  
  Reply
Daniel Pietschmann June 25, 2018 at 12:22 am #

Dear Jason Brownlee,

thanks a lot for this awesome tutorial, it really helps a lot!

Sadly I get an error using this part of the code:
“# n-gram individual BLEU
from nltk.translate.bleu_score import sentence_bleu
reference = [[‘this’, ‘is’, ‘a’, ‘test’]]
candidate = [‘this’, ‘is’, ‘a’, ‘test’]
print(‘Individual 1-gram: %f’ % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print(‘Individual 2-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print(‘Individual 3-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print(‘Individual 4-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))”

This is the error message: “Corpus/Sentence contains 0 counts of 3-gram overlaps.
BLEU scores might be undesirable; use SmoothingFunction().
warnings.warn(_msg)”

I tried adding a smoothing Function:
“from nltk.translate.bleu_score import SmoothingFunction
chencherry = SmoothingFunction()
print(‘Cumulative 1-gram: %f’ % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=chencherry.method4))
print(‘Cumulative 2-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0), smoothing_function=chencherry.method4))
print(‘Cumulative 3-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0), smoothing_function=chencherry.method4))
print(‘Cumulative 4-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=chencherry.method4))”

That helps, now the error message is gone, but no I have different scores from yours:
“Cumulative 1-gram: 0.750000
Cumulative 2-gram: 0.500000
Cumulative 3-gram: 0.358299
Cumulative 4-gram: 0.286623”

I don’t really understand what the problem was, and why I get different results now.
I would be super grateful if you could explain me happened in my code.

Thanks a lot in advance 🙂

Reply
- Jason Brownlee June 25, 2018 at 6:23 am #
  
  You will need a longer example of text.
  
  Reply
Gaurav Gupta July 24, 2018 at 4:55 am #

Great tutorial!

Reply
- Jason Brownlee July 24, 2018 at 6:24 am #
  
  Thanks.
  
  Reply
sawsan August 18, 2018 at 9:27 pm #

Thank you ,

Reply
- Jason Brownlee August 19, 2018 at 6:18 am #
  
  You’re welcome.
  
  Reply
Francesco September 7, 2018 at 2:42 am #

In the lazy fox example, changing quick to fast yields a significant drop in the BLEU score, yet the two sentences mean the same thing.

I wonder if we can mitigate the effect by using word vectors instead of the words themselves. Are you aware of an algorithm that uses BLEU on word embeddings?

Reply
- Jason Brownlee September 7, 2018 at 8:09 am #
  
  Agreed.
  
  Scoring based on meaning is a good idea. I’ve not seen anything on this sorry.
  
  Reply
- Todd September 22, 2020 at 3:36 am #
  
  Hey, i am thinking just like that..
  have you had any experiment on that regards?i think word embedding is definitely gonna yield more reasonable outcomes.
  
  Reply
Aziz October 12, 2018 at 12:10 pm #

Hi Jason and thanks for a great tutorial.

I think there is a mistake in the tutorial that contradict with our intuition rather than agreeing with it. The score for having two words shorter or longer is much like the score for having 1 word different rather than 2 words.

Reply
- Jason Brownlee October 13, 2018 at 6:07 am #
  
  Is it?
  
  Well, there’s no perfect measure for all cases.
  
  Reply
Chen Mei February 1, 2019 at 12:47 am #

How to calculate ROUGE, CIDEr and METEOR values in Python ?

Reply
- Jason Brownlee February 1, 2019 at 5:39 am #
  
  Sorry, I don’t have an example of calculating these measures.
  
  Reply
  - Zara July 31, 2019 at 7:35 am #
    
    May you please create a tutorial about how to calculate METEOR, TER, and ROUGE?
    
    Reply
    - Jason Brownlee July 31, 2019 at 2:05 pm #
      
      Great suggestions, thanks!
      
      Reply
      - Micky February 12, 2020 at 11:06 am #
        
        When are you gonna create a tutorial on how to calculate METEOR, TER, and ROUGE Sir?
      - Jason Brownlee February 12, 2020 at 1:36 pm #
        
        No fixed schedule at this stage.
Dave Howcroft March 8, 2019 at 1:36 am #

I think it’s misleading to suggest that it makes sense to use BLEU for generation and image captioning. BLEU seems to work well for the task it was designed for (MT eval during development), but there’s not really evidence to support the idea that it’s a good metric for NLG. See, for example, Ehud Reiter’s paper from last year: https://aclanthology.info/papers/J18-3002/j18-3002

One of the issues is that BLEU is usually calculated using only a few reference texts, but there’s also reason to think that we can’t reasonably expand the set of references to cover enough of the space of valid texts for a given meaning for it to be a good measure (cf. this work on similar metrics for grammatical error correction: https://aclanthology.info/papers/P18-1059/p18-1059).

Reply
- Jason Brownlee March 8, 2019 at 7:53 am #
  
  I think you might be right, perplexity might be a better measure for language generation tasks.
  
  Reply
Madhav April 12, 2019 at 3:54 pm #

Hi Jason,

I’m working on Automatic Question Generation. Can I use BLEU as an evaluation metric. If yes, how does it adapt to questions. If not, what other metric would you suggest me?

Reply
- Jason Brownlee April 13, 2019 at 6:21 am #
  
  Perhaps, or ROGUE or similar scores.
  
  Perhaps check recent papers on the topic and see what is common.
  
  Reply
Shubham May 9, 2019 at 3:04 am #

>>> from nltk.translate.bleu_score import sentence_bleu
>>> reference = [[‘this’, ‘is’, ‘small’, ‘test’]]
>>> candidate = [‘this’, ‘is’, ‘a’, ‘test’]
>>> print(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
1.0547686614863434e-154

Reply
- Jason Brownlee May 9, 2019 at 6:47 am #
  
  Well done!
  
  Reply
Justen May 21, 2019 at 5:18 pm #

I’m also running into the same problem.
The example you give gives a score of 0.707106781187
I on the other hand get a extremely strong score of 1.0547686614863434e-154
What is happening?

Reply
- Justen May 21, 2019 at 5:21 pm #
  
  Sorry so many typos.
  Shubham gave this code as a example.
  >>> from nltk.translate.bleu_score import sentence_bleu
  >>> reference = [[‘this’, ‘is’, ‘small’, ‘test’]]
  >>> candidate = [‘this’, ‘is’, ‘a’, ‘test’]
  >>> print(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
  1.0547686614863434e-154
  
  You run the same code, but get a bleu score of 0.707106781187
  What is happening?
  
  Reply
  - Jason Brownlee May 22, 2019 at 7:43 am #
    
    I get the same result, perhaps the API has changed recently?
    
    I will schedule time to update the post.
    
    Reply
- Jason Brownlee May 22, 2019 at 7:39 am #
  
  That is surprising, is your library up to date? Did you copy all of the code exactly?
  
  Reply
  - Justen May 23, 2019 at 12:31 am #
    
    Yes, the exact same code with the exact same example.
    Because of this ‘bug’ a lot of bleu scores evaluate to zero or nearly zero, like 1.0547686614863434e-154. I’ve yet to find any reason to why this is happening.
    
    Reply
    - Jason Brownlee May 23, 2019 at 6:04 am #
      
      I will investigate.
      
      Reply
- Sanjita Suresh July 2, 2019 at 3:29 am #
  
  Could you find some solution to this? even I am facing the same issue
  
  Reply
Pavithra June 12, 2019 at 9:00 pm #

Is it possible to find the BLEU score of a machine translation model ?

Reply
- Jason Brownlee June 13, 2019 at 6:15 am #
  
  Sure.
  
  Reply
Pavithra June 13, 2019 at 1:50 pm #

Can you please share how to identify the BLEU score of a model. I have build the machine translation model with Moses.

Thanks.

Reply
- Jason Brownlee June 13, 2019 at 2:36 pm #
  
  The tutorial above shows you how to calculate it.
  
  Reply
Sanjita Suresh July 2, 2019 at 3:27 am #

Thank you for the great tutorial.
I am getting different bleu scores in Google colab and Jupyter notebook

prediction_en = ‘A man in an orange hat presenting something’
reference_en= ‘A man in an orange hat starring at something’

Used code,

from nltk.translate.bleu_score import sentence_bleu
from nltk.translate.bleu_score import SmoothingFunction
smoother = SmoothingFunction()

sentence_bleu(reference_en, prediction_en , smoothing_function=smoother.method4)

For this I am getting a Bleu score of 0.768521 in google colab but I am getting 1.3767075064676063e-231 score in jupyter notebook without smoothing and with smoothing 0.3157039

Can you please help me what and where I am going wrong?

Reply
- Jason Brownlee July 2, 2019 at 7:35 am #
  
  I recommend not using notebooks:
  https://machinelearningmastery.com/faq/single-faq/why-dont-use-or-recommend-notebooks
  
  Reply
- alex September 14, 2019 at 1:22 am #
  
  Try adding weights to both colab & on local machine..something like this.
  
  score = sentence_bleu(reference, candidate, weights=(1, 0,0,0))
  
  I am getting same results
  
  Reply
- Tohida February 23, 2022 at 1:57 am #
  
  Now in colab it is giving very low score. Can you tell me the reason?
  prediction_en = ‘A man in an orange hat presenting something.’
  reference_en= ‘A man in an orange hat starring at something.’
  
  from nltk.translate.bleu_score import sentence_bleu
  from nltk.translate.bleu_score import SmoothingFunction
  smoother = SmoothingFunction()
  
  sentence_bleu(reference_en, prediction_en,smoothing_function=smoother.method4)
  Output:0.013061727262337088
  
  Reply
  - James Carmichael February 23, 2022 at 12:21 pm #
    
    Hi Tohida…Are you experiencing different results within Colab and another environment?
    
    Reply
Quang Le July 16, 2019 at 11:20 pm #

Hi Jason, I am training 2 neural machine translation model (model A and B with different improvements each model) with fairseq-py. When I evaluate model with bleu score, model A BLEU score is 25.9 and model B is 25.7. Then i filtered data by length into 4 range values such as 1 to 10 words, 11 to 20 words, 21 to 30 words and 31 to 40 words. I re-evaluated on each filtered data and all bleu scores of model B is greater than model A. Do you think this is normal case?

Reply
- Jason Brownlee July 17, 2019 at 8:26 am #
  
  Yes, the fine grained evaluation might be more relevant to you.
  
  Reply
Shyam Yadav December 15, 2019 at 6:13 am #

What is the difference between BLEU-1, BLEU-2, BLEU-3, BLEU-4? Is it 1-gram, 2-gram,…. Another doubt I had in my mind is that what is the difference between weights=(0.25, 0.25, 0.25, 0.25) and weights=(0, 0, 0, 1) for n = 4 BLEU?

Reply
- Jason Brownlee December 16, 2019 at 6:02 am #
  
  They evaluate different length sequences of words.
  
  You can see the different in some of the worked examples above.
  
  Reply
  - Shyam Yadav December 17, 2019 at 8:50 pm #
    
    I am still confused on when to use (0,0,0,1) and (0.25, 0.25, 0.25, 0.25).
    
    Reply
    - Jason Brownlee December 18, 2019 at 6:04 am #
      
      Good question.
      
      Use 0,0,0,1 when you only are about correct 4-grams
      
      Use 0.25,0.25,0.25,0.25 when you care about 1-,2-,3-,4- grams all with the same weight.
      
      Reply
      - Shyam Yadav December 19, 2019 at 4:58 am #
        
        Ok sir, thank you.
      - shab May 9, 2022 at 7:38 pm #
        
        why considering unigrams where unigrams just mean the word itself while cumulative ngram scores…is bi, tri or four grams score better than unigram score
      - James Carmichael May 10, 2022 at 12:09 pm #
        
        Hi Shab…Please rephrase your question so that we may better assist you.
Shyam Yadav December 26, 2019 at 6:20 am #

Can you tell me how to do calculation for weights = (0.5,0.5,0,0) through pen and paper for any of the reference and predicted?

Reply
- Shyam Yadav December 26, 2019 at 6:22 am #
  
  I have used this
  
  references = [[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumped’, ‘over’, ‘the’, ‘lazy’, ‘dog’]]
  candidates = [‘the’, ‘quick’, ‘fox’, ‘jumped’, ‘on’, ‘the’, ‘dog’]
  
  score = sentence_bleu(references, candidates, weights = (0.5,0.5,0,0), smoothing_function=SmoothingFunction().method0)
  print(score)
  
  and got the output as
  
  0.4016815092325757
  
  Can you please tell me how is it done mathematically step by step? Thank you alot
  
  Reply
  - Jason Brownlee December 26, 2019 at 7:43 am #
    
    Yes, you can see the calculation in the paper.
    
    Reply
- Jason Brownlee December 26, 2019 at 7:42 am #
  
  Great question!
  
  The paper referenced in the tutorial will show you the calculation.
  
  Reply
  - Shyam Yadav December 26, 2019 at 5:54 pm #
    
    Where is the tutorial and where can I see the calculation in the paper? Can you please give me the link?
    
    Reply
    - Jason Brownlee December 27, 2019 at 6:32 am #
      
      I do not have a tutorial that goes through the calculation in the paper.
      
      Reply
      - Shyam Yadav December 28, 2019 at 12:28 am #
        
        The paper referenced in the tutorial will show you the calculation.
        
        What about this then? Any link? Or can you provide one tutorial or just a pic of where you can show me the calculation for cumulative bleu score? Please if possible?
      - Jason Brownlee December 28, 2019 at 7:48 am #
        
        I may cover the topic in the future.
safia February 21, 2020 at 7:03 pm #

hello Janson,
Can you please write answer or validate these ( link) assumption about Bleu?
https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213

Reply
- Jason Brownlee February 22, 2020 at 6:23 am #
  
  Sorry, I don’t have the capacity to vet third party tutorials for you.
  
  Perhaps you can summarize the problem you’re having in a sentence or two?
  
  Reply
  - safia March 1, 2020 at 11:58 pm #
    
    oh, my apologies. Thank you for all the tutorials. It is really a great help for us.
    
    Reply
    - Jason Brownlee March 2, 2020 at 6:16 am #
      
      You’re welcome.
      
      Reply
sawsan February 27, 2020 at 4:53 pm #

please how can i use smooth function with corpus level

Reply
- Jason Brownlee February 28, 2020 at 5:58 am #
  
  Specify the smoothing when calculating the score after your model has made predictions for your whole corpus.
  
  Reply
  - sawsan February 28, 2020 at 4:34 pm #
    
    thank you
    
    Reply
    - Jason Brownlee February 29, 2020 at 7:07 am #
      
      You’re welcome.
      
      Reply
sawsan March 2, 2020 at 5:06 pm #

please i apply the equation and example

from nltk.translate.bleu_score import sentence_bleu
reference = [[‘the’, ‘cat’, ‘is’, ‘sitting’,’on’, ‘the’ ,’mat’]]
candidate = [‘on’, ‘the’, ‘mat’, ‘is’,’a’,’cat’]
score = sentence_bleu(reference, candidate,weights=(0.25, 0.25, 0.25,0.25))
print(score)
5.5546715329196825e-78

but by equation in the blog from in this blog :
https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4
i get the score 0.454346419
bleu=EXP(1-7/6)*EXP(LN(0.83)*0.25+LN(0.4)*0.25+LN(0.25)*0.25)
why the result is different .
can you help me please?

Reply
- Jason Brownlee March 3, 2020 at 5:56 am #
  
  I’m not familiar with that blog, perhaps contact the author directly.
  
  Reply
sawsan Asjea March 2, 2020 at 5:47 pm #

hello Jason.
i think the result will match when do this .
from nltk.translate.bleu_score import sentence_bleu
reference = [[‘the’, ‘cat’, ‘is’, ‘sitting’,’on’, ‘the’ ,’mat’]]
candidate = [‘on’, ‘the’, ‘mat’, ‘is’,’a’,’cat’]
score = sentence_bleu(reference, candidate,weights=(0.25, 0.25, 0.25))
print(score)
0.4548019047027907

is it correct to ignore the last weight 0.25, How can I explain this in your opinion?;

Reply
amel March 26, 2020 at 4:50 am #

great tutorial, thank you Jason.

Reply
- Jason Brownlee March 26, 2020 at 8:02 am #
  
  Thanks!
  
  Reply
Amrutesh April 14, 2020 at 4:23 am #

I built a neural net to generate multiple captions
I’m using flickr8k dataset so I have 5 captions as candidate
How to generate bleu score for the multiple captions?

Reply
- Jason Brownlee April 14, 2020 at 6:27 am #
  
  See this tutorial for an example:
  https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/
  
  Reply
Sunny April 25, 2020 at 4:52 am #

How do i use BLeU to compare the 2 text generation models lets say LSTM and ngram using the generated text generated? what will be the reference in that case?

Reply
- Jason Brownlee April 25, 2020 at 7:04 am #
  
  Calculate the score for each model on the same dataset, and compare.
  
  Reply
  - Sunny April 26, 2020 at 8:11 am #
    
    Yeah but what will be the reference? I know candidate will be the output text but what will be the reference? If i use all my train set to generate text – will my original text of 50000 rows of text be the reference?
    
    Reply
    - Jason Brownlee April 27, 2020 at 5:22 am #
      
      The expected text output for the test data is used as the reference.
      
      Reply
Dooji May 5, 2020 at 6:33 am #

Hi! thanks for the post. I am using BLEU for evaluating a summarization model. Thus, the sentences generated by my model and the ground truth summary are not aligned and do not have the same count. I want to know if that will be a problem if I wanna use the corpus_bleu. cause in the documentation it seems like each sentence in hyp needs to have a corresponding reference sentence.

Reply
- Jason Brownlee May 5, 2020 at 6:37 am #
  
  From memory, I think it will fine, as long as there are some n-grams to compare.
  
  Reply
  - Dr. Abdulnaser November 21, 2020 at 12:00 pm #
    
    Thank you Jassin for informative tutorial. Can i use bleu for machine translation post Editing
    
    Reply
    - Jason Brownlee November 21, 2020 at 1:05 pm #
      
      Yes, you can use BLEU for evaluating machine translation models.
      
      Reply
ghaith July 22, 2020 at 4:46 am #

why is loss high ?
it reach to 2 ?
i know it must be under 1

Reply
- Jason Brownlee July 22, 2020 at 5:45 am #
  
  Perhaps try training the model again.
  
  Reply
Asha October 2, 2020 at 4:15 pm #

Hello sir,

Great tutorial !

Can I calculate BLEU score for translation that handles transliterated text?

Reply
- Jason Brownlee October 3, 2020 at 6:05 am #
  
  Yes.
  
  Reply
DHILIP KUMAR T.P October 16, 2020 at 5:06 am #

hello jason,

we can apply bleu score for speech recognition system

Reply
- Jason Brownlee October 16, 2020 at 5:57 am #
  
  I don’t know off hand sorry, I would guess no, and would recommend you check the literature.
  
  Reply
Felipe November 29, 2020 at 3:33 am #

Hello Dr, Brownlee.

I want to know if it is normal / correct to use the Bleu Score in the training phase of a Deep Learning Model, or should it only be used in the testing phase?

I have a deep learning model and three sets of data – training, validation and testing.

Reply
- Jason Brownlee November 29, 2020 at 8:15 am #
  
  For testing phase, model evaluation.
  
  Reply
Azaz Ur Rehman Butt March 1, 2021 at 9:08 pm #

Dear Jason, I am working on an image captioning task and the BELU Score I’m getting is under 0.6, will it work fine for my model or I’ll have to improve it?

Reply
- Jason Brownlee March 2, 2021 at 5:43 am #
  
  Perhaps compare your score to other simpler models on the same dataset to see if it has relative skill.
  
  Reply
Stanislav March 17, 2021 at 5:41 am #

In another sentence_bleu tutorial i’ve noticed, that weights for 3-gram was defined as (1, 0, 1, 0). Can you please clarify this moment, because i have no idea of the purpose of the first digit in the tuple?

Reply
- Jason Brownlee March 17, 2021 at 6:11 am #
  
  Sorry, I don’t understand your question. Can you please elaborate?
  
  Reply
KG17 April 22, 2021 at 11:19 pm #

This is great information, however I just have a question relating to calculating BLEU scores for entire docuements. The examples you show are with sentences and I am interested in comparing .txt documents with each other. Do you possibly have an example or could explain how I could do this as that is not entirely clear to me from the explaination.
Thanks so much, any advice would be appreciated!

Reply
- Jason Brownlee April 23, 2021 at 5:04 am #
  
  Thanks.
  
  Perhaps averaged over sentences? See the above examples.
  
  Reply
srz May 5, 2021 at 5:54 am #

Hey Jason. Thanks for such a concise and clear explanation.
However, I’ve been working on language models these days and have observed people getting a bleu score as high as 36 and 50s. How is it even possible if a perfect score is 1?
An article from google cloud states that a good bleu score is above 50.
Where am I wrong in understanding?
Thank You

Reply
- Jason Brownlee May 5, 2021 at 6:14 am #
  
  You’re welcome.
  
  Perhaps they are reporting the bleu score multiplied by 100.
  
  Reply
  - srz May 6, 2021 at 7:30 am #
    
    oh, thank you very much. Yes they are presenting as a percentage instead of a decimal number
    
    Reply
    - Jason Brownlee May 7, 2021 at 6:23 am #
      
      You’re welcome.
      
      Reply
MUHAMMAD KAMRAN September 2, 2021 at 6:30 pm #

Hey justin , how r u doing … what if the caption generation model give 0.9 bleu score , is it possible and acceptable or there is something wrong with the model ??

Reply
- Jason Brownlee September 3, 2021 at 5:29 am #
  
  You must decide whether a given model is appropriate for your specific project or not.
  
  Reply
Bambang Setiawan November 15, 2021 at 2:21 pm #

Hi

No I’m working on RNN to create a conversation model using Tensorflow/Keras.

Do you know how do I add BLEU score while compiling the model?

Thanks

Reply
- Adrian Tam November 16, 2021 at 1:58 am #
  
  It doesn’t seem to have BLEU in Keras. Probably you need to check if there is a third party implementation, or you need to write your own function to do that.
  
  Reply
wajahat February 23, 2022 at 8:24 pm #

any code implantation of CIDER or Meteor like that.

Reply
- James Carmichael February 24, 2022 at 12:57 pm #
  
  Hi Wajahat…I am not familiar with either.
  
  Reply
Daniel Kleine June 5, 2023 at 10:44 pm #

“Running this example prints a score of 0.5.”
-> You mean 0.75, right?

Reply
MUHOOZI C.DENIS December 6, 2024 at 10:12 am #

Thank you Dr for your great work

Reply
- James Carmichael December 7, 2024 at 5:40 am #
  
  You are very welcome Muhoozi! We greatly appreciate it!
  
  Reply

Navigation

A Gentle Introduction to Calculating the BLEU Score for Text in Python

Tutorial Overview

Need help with Deep Learning for Text Data?

Bilingual Evaluation Understudy Score

Calculate BLEU Scores

Sentence BLEU Score

Corpus BLEU Score

Cumulative and Individual BLEU Scores

Individual N-Gram Scores

Cumulative N-Gram Scores

Worked Examples

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

114 Responses to A Gentle Introduction to Calculating the BLEU Score for Text in Python

Leave a Reply Click here to cancel reply.