[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

A Gentle Introduction to Calculating the BLEU Score for Text in Python

BLEU, or the Bilingual Evaluation Understudy, is a score for comparing a candidate translation of text to one or more reference translations.

Although developed for translation, it can be used to evaluate text generated for a suite of natural language processing tasks.

In this tutorial, you will discover the BLEU score for evaluating and scoring candidate text using the NLTK library in Python.

After completing this tutorial, you will know:

  • A gentle introduction to the BLEU score and an intuition for what is being calculated.
  • How you can calculate BLEU scores in Python using the NLTK library for sentences and documents.
  • How you can use a suite of small examples to develop an intuition for how differences between a candidate and reference text impact the final BLEU score.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • May/2019: Updated to reflect changes to the API in NLTK 3.4.1+.
A Gentle Introduction to Calculating the BLEU Score for Text in Python

A Gentle Introduction to Calculating the BLEU Score for Text in Python
Photo by Bernard Spragg. NZ, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

  1. Bilingual Evaluation Understudy Score
  2. Calculate BLEU Scores
  3. Cumulative and Individual BLEU Scores
  4. Worked Examples

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Bilingual Evaluation Understudy Score

The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric for evaluating a generated sentence to a reference sentence.

A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.0.

The score was developed for evaluating the predictions made by automatic machine translation systems. It is not perfect, but does offer 5 compelling benefits:

  • It is quick and inexpensive to calculate.
  • It is easy to understand.
  • It is language independent.
  • It correlates highly with human evaluation.
  • It has been widely adopted.

The BLEU score was proposed by Kishore Papineni, et al. in their 2002 paper “BLEU: a Method for Automatic Evaluation of Machine Translation“.

The approach works by counting matching n-grams in the candidate translation to n-grams in the reference text, where 1-gram or unigram would be each token and a bigram comparison would be each word pair. The comparison is made regardless of word order.

The primary programming task for a BLEU implementor is to compare n-grams of the candidate with the n-grams of the reference translation and count the number of matches. These matches are position-independent. The more the matches, the better the candidate translation is.

BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

The counting of matching n-grams is modified to ensure that it takes the occurrence of the words in the reference text into account, not rewarding a candidate translation that generates an abundance of reasonable words. This is referred to in the paper as modified n-gram precision.

Unfortunately, MT systems can overgenerate “reasonable” words, resulting in improbable, but high-precision, translations […] Intuitively the problem is clear: a reference word should be considered exhausted after a matching candidate word is identified. We formalize this intuition as the modified unigram precision.

BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

The score is for comparing sentences, but a modified version that normalizes n-grams by their occurrence is also proposed for better scoring blocks of multiple sentences.

We first compute the n-gram matches sentence by sentence. Next, we add the clipped n-gram counts for all the candidate sentences and divide by the number of candidate n-grams in the test corpus to compute a modified precision score, pn, for the entire test corpus.

BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

A perfect score is not possible in practice as a translation would have to match the reference exactly. This is not even possible by human translators. The number and quality of the references used to calculate the BLEU score means that comparing scores across datasets can be troublesome.

The BLEU metric ranges from 0 to 1. Few translations will attain a score of 1 unless they are identical to a reference translation. For this reason, even a human translator will not necessarily score 1. […] on a test corpus of about 500 sentences (40 general news stories), a human translator scored 0.3468 against four references and scored 0.2571 against two references.

BLEU: a Method for Automatic Evaluation of Machine Translation, 2002.

In addition to translation, we can use the BLEU score for other language generation problems with deep learning methods such as:

  • Language generation.
  • Image caption generation.
  • Text summarization.
  • Speech recognition.

And much more.

Calculate BLEU Scores

The Python Natural Language Toolkit library, or NLTK, provides an implementation of the BLEU score that you can use to evaluate your generated text against a reference.

Sentence BLEU Score

NLTK provides the sentence_bleu() function for evaluating a candidate sentence against one or more reference sentences.

The reference sentences must be provided as a list of sentences where each reference is a list of tokens. The candidate sentence is provided as a list of tokens. For example:

Running this example prints a perfect score as the candidate matches one of the references exactly.

Corpus BLEU Score

NLTK also provides a function called corpus_bleu() for calculating the BLEU score for multiple sentences such as a paragraph or a document.

The references must be specified as a list of documents where each document is a list of references and each alternative reference is a list of tokens, e.g. a list of lists of lists of tokens. The candidate documents must be specified as a list where each document is a list of tokens, e.g. a list of lists of tokens.

This is a little confusing; here is an example of two references for one document.

Running the example prints a perfect score as before.

Cumulative and Individual BLEU Scores

The BLEU score calculations in NLTK allow you to specify the weighting of different n-grams in the calculation of the BLEU score.

This gives you the flexibility to calculate different types of BLEU score, such as individual and cumulative n-gram scores.

Let’s take a look.

Individual N-Gram Scores

An individual N-gram score is the evaluation of just matching grams of a specific order, such as single words (1-gram) or word pairs (2-gram or bigram).

The weights are specified as a tuple where each index refers to the gram order. To calculate the BLEU score only for 1-gram matches, you can specify a weight of 1 for 1-gram and 0 for 2, 3 and 4 (1, 0, 0, 0). For example:

Running this example prints a score of 0.5.

We can repeat this example for individual n-grams from 1 to 4 as follows:

Running the example gives the following results.

Although we can calculate the individual BLEU scores, this is not how the method was intended to be used and the scores do not carry a lot of meaning, or seem that interpretable.

Cumulative N-Gram Scores

Cumulative scores refer to the calculation of individual n-gram scores at all orders from 1 to n and weighting them by calculating the weighted geometric mean.

By default, the sentence_bleu() and corpus_bleu() scores calculate the cumulative 4-gram BLEU score, also called BLEU-4.

The weights for the BLEU-4 are 1/4 (25%) or 0.25 for each of the 1-gram, 2-gram, 3-gram and 4-gram scores. For example:

Running this example prints the following score:

The cumulative and individual 1-gram BLEU use the same weights, e.g. (1, 0, 0, 0). The 2-gram weights assign a 50% to each of 1-gram and 2-gram and the 3-gram weights are 33% for each of the 1, 2 and 3-gram scores.

Let’s make this concrete by calculating the cumulative scores for BLEU-1, BLEU-2, BLEU-3 and BLEU-4:

Running the example prints the following scores. They are quite different and more expressive than the

They are quite different and more expressive than the standalone individual n-gram scores.

It is common to report the cumulative BLEU-1 to BLEU-4 scores when describing the skill of a text generation system.

Worked Examples

In this section, we try to develop further intuition for the BLEU score with some examples.

We work at the sentence level with a single reference sentence of the following:

the quick brown fox jumped over the lazy dog

First, let’s look at a perfect score.

Running the example prints a perfect match.

Next, let’s change one word, ‘quick‘ to ‘fast‘.

This result is a slight drop in score.

Try changing two words, both ‘quick‘ to ‘fast‘ and ‘lazy‘ to ‘sleepy‘.

Running the example, we can see a linear drop in skill.

What about if all words are different in the candidate?

We get the worse possible score.

Now, let’s try a candidate that has fewer words than the reference (e.g. drop the last two words), but the words are all correct.

The score is much like the score when two words were wrong above.

How about if we make the candidate two words longer than the reference?

Again, we can see that our intuition holds and the score is something like “two words wrong“.

Finally, let’s compare a candidate that is way too short: only two words in length.

Running this example first prints a warning message indicating that the 3-gram and above part of the evaluation (up to 4-gram) cannot be performed. This is fair given we only have 2-grams to work with in the candidate.

Next, we can a score that is very low indeed.

I encourage you to continue to play with examples.

The math is pretty simple and I would also encourage you to read the paper and explore calculating the sentence-level score yourself in a spreadsheet.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Summary

In this tutorial, you discovered the BLEU score for evaluating and scoring candidate text to reference text in machine translation and other language generation tasks.

Specifically, you learned:

  • A gentle introduction to the BLEU score and an intuition for what is being calculated.
  • How you can calculate BLEU scores in Python using the NLTK library for sentences and documents.
  • How to can use a suite of small examples to develop an intuition for how differences between a candidate and reference text impact the final BLEU score.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more...

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

See What's Inside

112 Responses to A Gentle Introduction to Calculating the BLEU Score for Text in Python

  1. Avatar
    ngc December 25, 2017 at 10:08 am #

    Good day Dr. Brownlee, I am wondering if I can use BLEU as a criteria for early stopping?

    • Avatar
      Jason Brownlee December 26, 2017 at 5:12 am #

      Sure.

      • Avatar
        Bashayer January 7, 2020 at 1:41 am #

        Please how can i down load bleu score

        How can i down load bleu score

  2. Avatar
    Sasikanth January 8, 2018 at 5:01 pm #

    Hello Jason,

    wonderful to learn about BLEU (Bilingual Evaluation Understudy). Is there a package in R for this?

    thanks

  3. Avatar
    Davin Chern January 31, 2018 at 6:34 pm #

    Hi Jason,

    Thanks for your excellent introduction about BLEU.

    When I try to calculate the BLEU scores for multiple sentences with corpus_bleu(), I found something strange.

    Suppose I have a paragraph with two sentences, and I try to translate them both, the following are two cases:

    case 1:
    references = [[[‘a’, ‘b’, ‘c’, ‘d’]], [[‘e’, ‘f’, ‘g’]]]
    candidates = [[‘a’, ‘b’, ‘c’, ‘d’], [‘e’, ‘f’, ‘g’]]
    score = corpus_bleu(references, candidates)

    case 2:
    references = [[[‘a’, ‘b’, ‘c’, ‘d’, ‘x’]], [[‘e’, ‘f’, ‘g’, ‘y’]]]
    candidates = [[‘a’, ‘b’, ‘c’, ‘d’, ‘x’], [‘e’, ‘f’, ‘g’, ‘y’]]
    score = corpus_bleu(references, candidates)

    I assume both should give me a result of 1.0, but only the second does, while the first is 0.84. Actually when both sentences have a length of 4 or above, the answer is always 1.0, so I think it is because the second sentence of case 1 has no 4-gram.

    In practice, when dealing with sentence whose length is smaller than 4, do we have to make corpus_bleu() ignore the redundant n-gram cases by setting appropriate weights?

    I appreciate your help!

    • Avatar
      Jason Brownlee February 1, 2018 at 7:18 am #

      Yes, ideally. I’d recommend reporting 1,2,3,4 ngram scores separately as well.

  4. Avatar
    Daniel Pietschmann June 25, 2018 at 12:22 am #

    Dear Jason Brownlee,

    thanks a lot for this awesome tutorial, it really helps a lot!

    Sadly I get an error using this part of the code:
    “# n-gram individual BLEU
    from nltk.translate.bleu_score import sentence_bleu
    reference = [[‘this’, ‘is’, ‘a’, ‘test’]]
    candidate = [‘this’, ‘is’, ‘a’, ‘test’]
    print(‘Individual 1-gram: %f’ % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
    print(‘Individual 2-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
    print(‘Individual 3-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
    print(‘Individual 4-gram: %f’ % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))”

    This is the error message: “Corpus/Sentence contains 0 counts of 3-gram overlaps.
    BLEU scores might be undesirable; use SmoothingFunction().
    warnings.warn(_msg)”

    I tried adding a smoothing Function:
    “from nltk.translate.bleu_score import SmoothingFunction
    chencherry = SmoothingFunction()
    print(‘Cumulative 1-gram: %f’ % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0), smoothing_function=chencherry.method4))
    print(‘Cumulative 2-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.5, 0.5, 0, 0), smoothing_function=chencherry.method4))
    print(‘Cumulative 3-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.33, 0.33, 0.33, 0), smoothing_function=chencherry.method4))
    print(‘Cumulative 4-gram: %f’ % sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=chencherry.method4))”

    That helps, now the error message is gone, but no I have different scores from yours:
    “Cumulative 1-gram: 0.750000
    Cumulative 2-gram: 0.500000
    Cumulative 3-gram: 0.358299
    Cumulative 4-gram: 0.286623”

    I don’t really understand what the problem was, and why I get different results now.
    I would be super grateful if you could explain me happened in my code.

    Thanks a lot in advance 🙂

  5. Avatar
    Gaurav Gupta July 24, 2018 at 4:55 am #

    Great tutorial!

  6. Avatar
    sawsan August 18, 2018 at 9:27 pm #

    Thank you ,

  7. Avatar
    Francesco September 7, 2018 at 2:42 am #

    In the lazy fox example, changing quick to fast yields a significant drop in the BLEU score, yet the two sentences mean the same thing.

    I wonder if we can mitigate the effect by using word vectors instead of the words themselves. Are you aware of an algorithm that uses BLEU on word embeddings?

    • Avatar
      Jason Brownlee September 7, 2018 at 8:09 am #

      Agreed.

      Scoring based on meaning is a good idea. I’ve not seen anything on this sorry.

    • Avatar
      Todd September 22, 2020 at 3:36 am #

      Hey, i am thinking just like that..
      have you had any experiment on that regards?i think word embedding is definitely gonna yield more reasonable outcomes.

  8. Avatar
    Aziz October 12, 2018 at 12:10 pm #

    Hi Jason and thanks for a great tutorial.

    I think there is a mistake in the tutorial that contradict with our intuition rather than agreeing with it. The score for having two words shorter or longer is much like the score for having 1 word different rather than 2 words.

    • Avatar
      Jason Brownlee October 13, 2018 at 6:07 am #

      Is it?

      Well, there’s no perfect measure for all cases.

  9. Avatar
    Chen Mei February 1, 2019 at 12:47 am #

    How to calculate ROUGE, CIDEr and METEOR values in Python ?

    • Avatar
      Jason Brownlee February 1, 2019 at 5:39 am #

      Sorry, I don’t have an example of calculating these measures.

      • Avatar
        Zara July 31, 2019 at 7:35 am #

        May you please create a tutorial about how to calculate METEOR, TER, and ROUGE?

        • Avatar
          Jason Brownlee July 31, 2019 at 2:05 pm #

          Great suggestions, thanks!

          • Avatar
            Micky February 12, 2020 at 11:06 am #

            When are you gonna create a tutorial on how to calculate METEOR, TER, and ROUGE Sir?

          • Avatar
            Jason Brownlee February 12, 2020 at 1:36 pm #

            No fixed schedule at this stage.

  10. Avatar
    Dave Howcroft March 8, 2019 at 1:36 am #

    I think it’s misleading to suggest that it makes sense to use BLEU for generation and image captioning. BLEU seems to work well for the task it was designed for (MT eval during development), but there’s not really evidence to support the idea that it’s a good metric for NLG. See, for example, Ehud Reiter’s paper from last year: https://aclanthology.info/papers/J18-3002/j18-3002

    One of the issues is that BLEU is usually calculated using only a few reference texts, but there’s also reason to think that we can’t reasonably expand the set of references to cover enough of the space of valid texts for a given meaning for it to be a good measure (cf. this work on similar metrics for grammatical error correction: https://aclanthology.info/papers/P18-1059/p18-1059).

    • Avatar
      Jason Brownlee March 8, 2019 at 7:53 am #

      I think you might be right, perplexity might be a better measure for language generation tasks.

  11. Avatar
    Madhav April 12, 2019 at 3:54 pm #

    Hi Jason,

    I’m working on Automatic Question Generation. Can I use BLEU as an evaluation metric. If yes, how does it adapt to questions. If not, what other metric would you suggest me?

    • Avatar
      Jason Brownlee April 13, 2019 at 6:21 am #

      Perhaps, or ROGUE or similar scores.

      Perhaps check recent papers on the topic and see what is common.

  12. Avatar
    Shubham May 9, 2019 at 3:04 am #

    >>> from nltk.translate.bleu_score import sentence_bleu
    >>> reference = [[‘this’, ‘is’, ‘small’, ‘test’]]
    >>> candidate = [‘this’, ‘is’, ‘a’, ‘test’]
    >>> print(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
    1.0547686614863434e-154

  13. Avatar
    Justen May 21, 2019 at 5:18 pm #

    I’m also running into the same problem.
    The example you give gives a score of 0.707106781187
    I on the other hand get a extremely strong score of 1.0547686614863434e-154
    What is happening?

    • Avatar
      Justen May 21, 2019 at 5:21 pm #

      Sorry so many typos.
      Shubham gave this code as a example.
      >>> from nltk.translate.bleu_score import sentence_bleu
      >>> reference = [[‘this’, ‘is’, ‘small’, ‘test’]]
      >>> candidate = [‘this’, ‘is’, ‘a’, ‘test’]
      >>> print(sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25)))
      1.0547686614863434e-154

      You run the same code, but get a bleu score of 0.707106781187
      What is happening?

      • Avatar
        Jason Brownlee May 22, 2019 at 7:43 am #

        I get the same result, perhaps the API has changed recently?

        I will schedule time to update the post.

    • Avatar
      Jason Brownlee May 22, 2019 at 7:39 am #

      That is surprising, is your library up to date? Did you copy all of the code exactly?

      • Avatar
        Justen May 23, 2019 at 12:31 am #

        Yes, the exact same code with the exact same example.
        Because of this ‘bug’ a lot of bleu scores evaluate to zero or nearly zero, like 1.0547686614863434e-154. I’ve yet to find any reason to why this is happening.

    • Avatar
      Sanjita Suresh July 2, 2019 at 3:29 am #

      Could you find some solution to this? even I am facing the same issue

  14. Avatar
    Pavithra June 12, 2019 at 9:00 pm #

    Is it possible to find the BLEU score of a machine translation model ?

  15. Avatar
    Pavithra June 13, 2019 at 1:50 pm #

    Can you please share how to identify the BLEU score of a model. I have build the machine translation model with Moses.

    Thanks.

  16. Avatar
    Sanjita Suresh July 2, 2019 at 3:27 am #

    Thank you for the great tutorial.
    I am getting different bleu scores in Google colab and Jupyter notebook

    prediction_en = ‘A man in an orange hat presenting something’
    reference_en= ‘A man in an orange hat starring at something’

    Used code,

    from nltk.translate.bleu_score import sentence_bleu
    from nltk.translate.bleu_score import SmoothingFunction
    smoother = SmoothingFunction()

    sentence_bleu(reference_en, prediction_en , smoothing_function=smoother.method4)

    For this I am getting a Bleu score of 0.768521 in google colab but I am getting 1.3767075064676063e-231 score in jupyter notebook without smoothing and with smoothing 0.3157039

    Can you please help me what and where I am going wrong?

    • Avatar
      alex September 14, 2019 at 1:22 am #

      Try adding weights to both colab & on local machine..something like this.

      score = sentence_bleu(reference, candidate, weights=(1, 0,0,0))

      I am getting same results

    • Avatar
      Tohida February 23, 2022 at 1:57 am #

      Now in colab it is giving very low score. Can you tell me the reason?
      prediction_en = ‘A man in an orange hat presenting something.’
      reference_en= ‘A man in an orange hat starring at something.’

      from nltk.translate.bleu_score import sentence_bleu
      from nltk.translate.bleu_score import SmoothingFunction
      smoother = SmoothingFunction()

      sentence_bleu(reference_en, prediction_en,smoothing_function=smoother.method4)
      Output:0.013061727262337088

      • Avatar
        James Carmichael February 23, 2022 at 12:21 pm #

        Hi Tohida…Are you experiencing different results within Colab and another environment?

  17. Avatar
    Quang Le July 16, 2019 at 11:20 pm #

    Hi Jason, I am training 2 neural machine translation model (model A and B with different improvements each model) with fairseq-py. When I evaluate model with bleu score, model A BLEU score is 25.9 and model B is 25.7. Then i filtered data by length into 4 range values such as 1 to 10 words, 11 to 20 words, 21 to 30 words and 31 to 40 words. I re-evaluated on each filtered data and all bleu scores of model B is greater than model A. Do you think this is normal case?

    • Avatar
      Jason Brownlee July 17, 2019 at 8:26 am #

      Yes, the fine grained evaluation might be more relevant to you.

  18. Avatar
    Shyam Yadav December 15, 2019 at 6:13 am #

    What is the difference between BLEU-1, BLEU-2, BLEU-3, BLEU-4? Is it 1-gram, 2-gram,…. Another doubt I had in my mind is that what is the difference between weights=(0.25, 0.25, 0.25, 0.25) and weights=(0, 0, 0, 1) for n = 4 BLEU?

    • Avatar
      Jason Brownlee December 16, 2019 at 6:02 am #

      They evaluate different length sequences of words.

      You can see the different in some of the worked examples above.

      • Avatar
        Shyam Yadav December 17, 2019 at 8:50 pm #

        I am still confused on when to use (0,0,0,1) and (0.25, 0.25, 0.25, 0.25).

        • Avatar
          Jason Brownlee December 18, 2019 at 6:04 am #

          Good question.

          Use 0,0,0,1 when you only are about correct 4-grams

          Use 0.25,0.25,0.25,0.25 when you care about 1-,2-,3-,4- grams all with the same weight.

          • Avatar
            Shyam Yadav December 19, 2019 at 4:58 am #

            Ok sir, thank you.

          • Avatar
            shab May 9, 2022 at 7:38 pm #

            why considering unigrams where unigrams just mean the word itself while cumulative ngram scores…is bi, tri or four grams score better than unigram score

          • Avatar
            James Carmichael May 10, 2022 at 12:09 pm #

            Hi Shab…Please rephrase your question so that we may better assist you.

  19. Avatar
    Shyam Yadav December 26, 2019 at 6:20 am #

    Can you tell me how to do calculation for weights = (0.5,0.5,0,0) through pen and paper for any of the reference and predicted?

    • Avatar
      Shyam Yadav December 26, 2019 at 6:22 am #

      I have used this

      references = [[‘the’, ‘quick’, ‘brown’, ‘fox’, ‘jumped’, ‘over’, ‘the’, ‘lazy’, ‘dog’]]
      candidates = [‘the’, ‘quick’, ‘fox’, ‘jumped’, ‘on’, ‘the’, ‘dog’]

      score = sentence_bleu(references, candidates, weights = (0.5,0.5,0,0), smoothing_function=SmoothingFunction().method0)
      print(score)

      and got the output as

      0.4016815092325757

      Can you please tell me how is it done mathematically step by step? Thank you alot

      • Avatar
        Jason Brownlee December 26, 2019 at 7:43 am #

        Yes, you can see the calculation in the paper.

    • Avatar
      Jason Brownlee December 26, 2019 at 7:42 am #

      Great question!

      The paper referenced in the tutorial will show you the calculation.

      • Avatar
        Shyam Yadav December 26, 2019 at 5:54 pm #

        Where is the tutorial and where can I see the calculation in the paper? Can you please give me the link?

        • Avatar
          Jason Brownlee December 27, 2019 at 6:32 am #

          I do not have a tutorial that goes through the calculation in the paper.

          • Avatar
            Shyam Yadav December 28, 2019 at 12:28 am #

            The paper referenced in the tutorial will show you the calculation.

            What about this then? Any link? Or can you provide one tutorial or just a pic of where you can show me the calculation for cumulative bleu score? Please if possible?

          • Avatar
            Jason Brownlee December 28, 2019 at 7:48 am #

            I may cover the topic in the future.

  20. Avatar
    safia February 21, 2020 at 7:03 pm #

    hello Janson,
    Can you please write answer or validate these ( link) assumption about Bleu?
    https://towardsdatascience.com/evaluating-text-output-in-nlp-bleu-at-your-own-risk-e8609665a213

    • Avatar
      Jason Brownlee February 22, 2020 at 6:23 am #

      Sorry, I don’t have the capacity to vet third party tutorials for you.

      Perhaps you can summarize the problem you’re having in a sentence or two?

      • Avatar
        safia March 1, 2020 at 11:58 pm #

        oh, my apologies. Thank you for all the tutorials. It is really a great help for us.

  21. Avatar
    sawsan February 27, 2020 at 4:53 pm #

    please how can i use smooth function with corpus level

    • Avatar
      Jason Brownlee February 28, 2020 at 5:58 am #

      Specify the smoothing when calculating the score after your model has made predictions for your whole corpus.

  22. Avatar
    sawsan March 2, 2020 at 5:06 pm #

    please i apply the equation and example

    from nltk.translate.bleu_score import sentence_bleu
    reference = [[‘the’, ‘cat’, ‘is’, ‘sitting’,’on’, ‘the’ ,’mat’]]
    candidate = [‘on’, ‘the’, ‘mat’, ‘is’,’a’,’cat’]
    score = sentence_bleu(reference, candidate,weights=(0.25, 0.25, 0.25,0.25))
    print(score)
    5.5546715329196825e-78

    but by equation in the blog from in this blog :
    https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4
    i get the score 0.454346419
    bleu=EXP(1-7/6)*EXP(LN(0.83)*0.25+LN(0.4)*0.25+LN(0.25)*0.25)
    why the result is different .
    can you help me please?

    • Avatar
      Jason Brownlee March 3, 2020 at 5:56 am #

      I’m not familiar with that blog, perhaps contact the author directly.

  23. Avatar
    sawsan Asjea March 2, 2020 at 5:47 pm #

    hello Jason.
    i think the result will match when do this .
    from nltk.translate.bleu_score import sentence_bleu
    reference = [[‘the’, ‘cat’, ‘is’, ‘sitting’,’on’, ‘the’ ,’mat’]]
    candidate = [‘on’, ‘the’, ‘mat’, ‘is’,’a’,’cat’]
    score = sentence_bleu(reference, candidate,weights=(0.25, 0.25, 0.25))
    print(score)
    0.4548019047027907

    is it correct to ignore the last weight 0.25, How can I explain this in your opinion?;

  24. Avatar
    amel March 26, 2020 at 4:50 am #

    great tutorial, thank you Jason.

  25. Avatar
    Amrutesh April 14, 2020 at 4:23 am #

    I built a neural net to generate multiple captions
    I’m using flickr8k dataset so I have 5 captions as candidate
    How to generate bleu score for the multiple captions?

  26. Avatar
    Sunny April 25, 2020 at 4:52 am #

    How do i use BLeU to compare the 2 text generation models lets say LSTM and ngram using the generated text generated? what will be the reference in that case?

    • Avatar
      Jason Brownlee April 25, 2020 at 7:04 am #

      Calculate the score for each model on the same dataset, and compare.

      • Avatar
        Sunny April 26, 2020 at 8:11 am #

        Yeah but what will be the reference? I know candidate will be the output text but what will be the reference? If i use all my train set to generate text – will my original text of 50000 rows of text be the reference?

        • Avatar
          Jason Brownlee April 27, 2020 at 5:22 am #

          The expected text output for the test data is used as the reference.

  27. Avatar
    Dooji May 5, 2020 at 6:33 am #

    Hi! thanks for the post. I am using BLEU for evaluating a summarization model. Thus, the sentences generated by my model and the ground truth summary are not aligned and do not have the same count. I want to know if that will be a problem if I wanna use the corpus_bleu. cause in the documentation it seems like each sentence in hyp needs to have a corresponding reference sentence.

    • Avatar
      Jason Brownlee May 5, 2020 at 6:37 am #

      From memory, I think it will fine, as long as there are some n-grams to compare.

      • Avatar
        Dr. Abdulnaser November 21, 2020 at 12:00 pm #

        Thank you Jassin for informative tutorial. Can i use bleu for machine translation post Editing

        • Avatar
          Jason Brownlee November 21, 2020 at 1:05 pm #

          Yes, you can use BLEU for evaluating machine translation models.

  28. Avatar
    ghaith July 22, 2020 at 4:46 am #

    why is loss high ?
    it reach to 2 ?
    i know it must be under 1

  29. Avatar
    Asha October 2, 2020 at 4:15 pm #

    Hello sir,

    Great tutorial !

    Can I calculate BLEU score for translation that handles transliterated text?

  30. Avatar
    DHILIP KUMAR T.P October 16, 2020 at 5:06 am #

    hello jason,

    we can apply bleu score for speech recognition system

    • Avatar
      Jason Brownlee October 16, 2020 at 5:57 am #

      I don’t know off hand sorry, I would guess no, and would recommend you check the literature.

  31. Avatar
    Felipe November 29, 2020 at 3:33 am #

    Hello Dr, Brownlee.

    I want to know if it is normal / correct to use the Bleu Score in the training phase of a Deep Learning Model, or should it only be used in the testing phase?

    I have a deep learning model and three sets of data – training, validation and testing.

  32. Avatar
    Azaz Ur Rehman Butt March 1, 2021 at 9:08 pm #

    Dear Jason, I am working on an image captioning task and the BELU Score I’m getting is under 0.6, will it work fine for my model or I’ll have to improve it?

    • Avatar
      Jason Brownlee March 2, 2021 at 5:43 am #

      Perhaps compare your score to other simpler models on the same dataset to see if it has relative skill.

  33. Avatar
    Stanislav March 17, 2021 at 5:41 am #

    In another sentence_bleu tutorial i’ve noticed, that weights for 3-gram was defined as (1, 0, 1, 0). Can you please clarify this moment, because i have no idea of the purpose of the first digit in the tuple?

    • Avatar
      Jason Brownlee March 17, 2021 at 6:11 am #

      Sorry, I don’t understand your question. Can you please elaborate?

  34. Avatar
    KG17 April 22, 2021 at 11:19 pm #

    This is great information, however I just have a question relating to calculating BLEU scores for entire docuements. The examples you show are with sentences and I am interested in comparing .txt documents with each other. Do you possibly have an example or could explain how I could do this as that is not entirely clear to me from the explaination.
    Thanks so much, any advice would be appreciated!

    • Avatar
      Jason Brownlee April 23, 2021 at 5:04 am #

      Thanks.

      Perhaps averaged over sentences? See the above examples.

  35. Avatar
    srz May 5, 2021 at 5:54 am #

    Hey Jason. Thanks for such a concise and clear explanation.
    However, I’ve been working on language models these days and have observed people getting a bleu score as high as 36 and 50s. How is it even possible if a perfect score is 1?
    An article from google cloud states that a good bleu score is above 50.
    Where am I wrong in understanding?
    Thank You

    • Avatar
      Jason Brownlee May 5, 2021 at 6:14 am #

      You’re welcome.

      Perhaps they are reporting the bleu score multiplied by 100.

      • Avatar
        srz May 6, 2021 at 7:30 am #

        oh, thank you very much. Yes they are presenting as a percentage instead of a decimal number

  36. Avatar
    MUHAMMAD KAMRAN September 2, 2021 at 6:30 pm #

    Hey justin , how r u doing … what if the caption generation model give 0.9 bleu score , is it possible and acceptable or there is something wrong with the model ??

    • Avatar
      Jason Brownlee September 3, 2021 at 5:29 am #

      You must decide whether a given model is appropriate for your specific project or not.

  37. Avatar
    Bambang Setiawan November 15, 2021 at 2:21 pm #

    Hi

    No I’m working on RNN to create a conversation model using Tensorflow/Keras.

    Do you know how do I add BLEU score while compiling the model?

    Thanks

    • Avatar
      Adrian Tam November 16, 2021 at 1:58 am #

      It doesn’t seem to have BLEU in Keras. Probably you need to check if there is a third party implementation, or you need to write your own function to do that.

  38. Avatar
    wajahat February 23, 2022 at 8:24 pm #

    any code implantation of CIDER or Meteor like that.

    • Avatar
      James Carmichael February 24, 2022 at 12:57 pm #

      Hi Wajahat…I am not familiar with either.

  39. Avatar
    Daniel Kleine June 5, 2023 at 10:44 pm #

    “Running this example prints a score of 0.5.”
    -> You mean 0.75, right?

Leave a Reply