How to Implement a Beam Search Decoder for Natural Language Processing

Natural language processing tasks, such as caption generation and machine translation, involve generating sequences of words.

Models developed for these problems often operate by generating probability distributions across the vocabulary of output words and it is up to decoding algorithms to sample the probability distributions to generate the most likely sequences of words.

In this tutorial, you will discover the greedy search and beam search decoding algorithms that can be used on text generation problems.

After completing this tutorial, you will know:

  • The problem of decoding on text generation problems.
  • The greedy search decoder algorithm and how to implement it in Python.
  • The beam search decoder algorithm and how to implement it in Python.

Let’s get started.

How to Implement Beam Search Decoder for Natural Language Processing

How to Implement Beam Search Decoder for Natural Language Processing
Photo by See1,Do1,Teach1, some rights reserved.

Decoder for Text Generation

In natural language processing tasks such as caption generation, text summarization, and machine translation, the prediction required is a sequence of words.

It is common for models developed for these types of problems to output a probability distribution over each word in the vocabulary for each word in the output sequence. It is then left to a decoder process to transform the probabilities into a final sequence of words.

You are likely to encounter this when working with recurrent neural networks on natural language processing tasks where text is generated as an output. The final layer in the neural network model has one neuron for each word in the output vocabulary and a softmax activation function is used to output a likelihood of each word in the vocabulary being the next word in the sequence.

Decoding the most likely output sequence involves searching through all the possible output sequences based on their likelihood. The size of the vocabulary is often tens or hundreds of thousands of words, or even millions of words. Therefore, the search problem is exponential in the length of the output sequence and is intractable (NP-complete) to search completely.

In practice, heuristic search methods are used to return one or more approximate or “good enough” decoded output sequences for a given prediction.

As the size of the search graph is exponential in the source sentence length, we have to use approximations to find a solution efficiently.

— Page 272, Handbook of Natural Language Processing and Machine Translation, 2011.

Candidate sequences of words are scored based on their likelihood. It is common to use a greedy search or a beam search to locate candidate sequences of text. We will look at both of these decoding algorithms in this post.

Each individual prediction has an associated score (or probability) and we are interested in output sequence with maximal score (or maximal probability) […] One popular approximate technique is using greedy prediction, taking the highest scoring item at each stage. While this approach is often effective, it is obviously non-optimal. Indeed, using beam search as an approximate search often works far better than the greedy approach.

— Page 227, Neural Network Methods in Natural Language Processing, 2017.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Crash-Course Now

Greedy Search Decoder

A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence.

This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.

We can demonstrate the greedy search approach to decoding with a small contrived example in Python.

We can start off with a prediction problem that involves a sequence of 10 words. Each word is predicted as a probability distribution over a vocabulary of 5 words.

We will assume that the words have been integer encoded, such that the column index can be used to look-up the associated word in the vocabulary. Therefore, the task of decoding becomes the task of selecting a sequence of integers from the probability distributions.

The argmax() mathematical function can be used to select the index of an array that has the largest value. We can use this function to select the word index that is most likely at each step in the sequence. This function is provided directly in numpy.

The greedy_decoder() function below implements this decoder strategy using the argmax function.

Putting this all together, the complete example demonstrating the greedy decoder is listed below.

Running the example outputs a sequence of integers that could then be mapped back to words in the vocabulary.

Beam Search Decoder

Another popular heuristic is the beam search that expands upon the greedy search and returns a list of most likely output sequences.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

The local beam search algorithm keeps track of k states rather than just one. It begins with k randomly generated states. At each step, all the successors of all k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best successors from the complete list and repeats.

— Pages 125-126, Artificial Intelligence: A Modern Approach (3rd Edition), 2009.

We do not need to start with random states; instead, we start with the k most likely words as the first step in the sequence.

Common beam width values are 1 for a greedy search and values of 5 or 10 for common benchmark problems in machine translation. Larger beam widths result in better performance of a model as the multiple candidate sequences increase the likelihood of better matching a target sequence. This increased performance results in a decrease in decoding speed.

In NMT, new sentences are translated by a simple beam search decoder that finds a translation that approximately maximizes the conditional probability of a trained NMT model. The beam search strategy generates the translation word by word from left-to-right while keeping a fixed number (beam) of active candidates at each time step. By increasing the beam size, the translation performance can increase at the expense of significantly reducing the decoder speed.

Beam Search Strategies for Neural Machine Translation, 2017.

The search process can halt for each candidate separately either by reaching a maximum length, by reaching an end-of-sequence token, or by reaching a threshold likelihood.

Let’s make this concrete with an example.

We can define a function to perform the beam search for a given sequence of probabilities and beam width parameter k. At each step, each candidate sequence is expanded with all possible next steps. Each candidate step is scored by multiplying the probabilities together. The k sequences with the most likely probabilities are selected and all other candidates are pruned. The process then repeats until the end of the sequence.

Probabilities are small numbers and multiplying small numbers together creates very small numbers. To avoid underflowing the floating point numbers, the natural logarithm of the probabilities are multiplied together, which keeps the numbers larger and manageable. Further, it is also common to perform the search by minimizing the score, therefore, the negative log of the probabilities are multiplied. This final tweak means that we can sort all candidate sequences in ascending order by their score and select the first k as the most likely candidate sequences.

The beam_search_decoder() function below implements the beam search decoder.

We can tie this together with the sample data from the previous section and this time return the 3 most likely sequences.

Running the example prints both the integer sequences and their log likelihood.

Experiment with different k values.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.


In this tutorial, you discovered the greedy search and beam search decoding algorithms that can be used on text generation problems.

Specifically, you learned:

  • The problem of decoding on text generation problems.
  • The greedy search decoder algorithm and how to implement it in Python.
  • The beam search decoder algorithm and how to implement it in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning models for Text Data Today!

Deep Learning for Natural Language Processing

Develop Your Own Text models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Natural Language Processing

It provides self-study tutorials on topics like:
Bag-of-Words, Word Embedding, Language Models, Caption Generation, Text Translation and much more…

Finally Bring Deep Learning to your Natural Language Processing Projects

Skip the Academics. Just Results.

Click to learn more.

18 Responses to How to Implement a Beam Search Decoder for Natural Language Processing

  1. Noam January 5, 2018 at 11:04 pm #

    This python/numpy example is really misleading. It inherently assumes that the generation of words is completely independent, or alternatively – that P(w_{t}|seq_{1:t-1}) = P(w_{t}).

    In this scenario, the results of greedy-search will *always* be the same of the best results of the beam-search.

    I suggest to fix the
    score = best_score[“i prev”] + -log P(word[i]|next)
    to –
    score = best_score[“i prev”] +-log P(next|prev) + -log P(word[i]|next)

    with a real RNN decoding example

    • Jason Brownlee January 6, 2018 at 5:53 am #

      Thanks Noam.

      • Shea February 18, 2018 at 11:06 am #

        I found that confusing too. I’m trying to understand why you wouldn’t always just choose the first candidate of the beam search (ie. just do the greedy search which is faster).

        Or perhaps there is some other way that you would determine that the not-best-greedy-seach-score-candidate is actually the best candidate?

        I’ll have to think a bit on @Noam’s suggestion.

        • Shea February 18, 2018 at 11:08 am #

          Sorry I meant to say:

          Or perhaps there is some other way that you would determine that the not-best-BEAM-seach-score-candidate is actually the best candidate?

          • Jason Brownlee February 19, 2018 at 9:01 am #

            The idea is to search the n-best paths through the probability sequence.

    • Max April 3, 2018 at 6:46 am #

      So Noam, under your comments, -log P(next|prev) means the -log of the probability of the next char given the previous char, correct? For example say your predicting characters of a word, and predict an ‘r’. The next top 3 predictions after this ‘r’ are ‘r’, ‘a’, and ‘t’. Now, we have P(next|’r’), and most likely P(‘r’|’r’) and P(‘t’|’r’) will both be small so that the beam search correctly chooses P(‘a’|’r’) and outputs ‘a’ after our ‘r’, correct?

    • Lawrence April 10, 2018 at 1:12 pm #

      Totally agree with you. I know a simple example is helpful to understanding, but an over-simplified example like this only mislead readers.

  2. Catherine March 6, 2018 at 8:17 pm #


    Can you please give me suggestions to do research work in machine learning.
    I need the problem or specific research current trend in machine learning,


    • Jason Brownlee March 7, 2018 at 6:11 am #

      Sorry, i cannot help you with your research topic.

  3. Surendra March 14, 2018 at 12:54 am #

    How to combine and use Beam Search with ARPA based language model?

  4. Surendra March 15, 2018 at 1:23 am #

    ARPA is a format that is used to represent all possible word sequences from a corpus of text data. The ARPA file lists each possible word sequence and its statistically estimated language probability tagged to it.

    The following link describes ARPA format in detail

    I was wondering if ARPA file would be useful to select the best sequence from the output of beam search?

  5. vikas dixit March 19, 2018 at 8:13 pm #

    Sir is there any other heuristic or meta heuristic search algorithm which can replace beam search decoder ??

  6. Phillip Glau April 28, 2018 at 12:44 pm #

    I’m not sure I understand why the example multiplies log probabilities. Aren’t log probabilities normally added together to get the equivalent of multiplied real-value scalar probabilities?

    0.5 * 0.5 * 0.25 = 0.0625

    log(0.5) + log(0.5) + log(0.25) = -2.772588722239781
    exp(-2.772588722239781) == 0.0624

    log(0.5) * log(0.5) * log(0.25) = -0.6660493039778589
    exp(-0.6660493039778589) =~ 0.51374 != 0.625 ??

    • Linda June 7, 2018 at 3:30 am #

      I think Philip Glau is correct. We’re supposed to add log probabilities and not multiply them.

      Another issue with the code is the line
      ” for j in range(len(row))”

      You’re iterating over all the data in row. In a typical text generation problem, there is a huge amount of words in a vocabulary, we probably don’t want to iterate over all of them? Rather we would be interested in only top k probabilities.

    • TheRightStef July 13, 2018 at 4:41 am #

      +1 For Phillip’s comment.

      Another problem is that log(1) = 0, so prod( log(p_i) ) ~ 0 for any sequence containing any character that is predicted with probability close to 1.0.

      Ultimately, this leading to a pathological degeneracy of solutions during inference. In my results, it manifested itself as long runs of newline characters.

Leave a Reply