How to Implement a Beam Search Decoder for Natural Language Processing

By Jason Brownlee on June 3, 2020 in Deep Learning for Natural Language Processing 51

Natural language processing tasks, such as caption generation and machine translation, involve generating sequences of words.

Models developed for these problems often operate by generating probability distributions across the vocabulary of output words and it is up to decoding algorithms to sample the probability distributions to generate the most likely sequences of words.

In this tutorial, you will discover the greedy search and beam search decoding algorithms that can be used on text generation problems.

After completing this tutorial, you will know:

The problem of decoding on text generation problems.
The greedy search decoder algorithm and how to implement it in Python.
The beam search decoder algorithm and how to implement it in Python.

Kick-start your project with my new book Deep Learning for Natural Language Processing, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update May/2020: Fixed bug in the beam search implementation (thanks everyone who pointed it out, and Constantin Weisser for his clean fix)

How to Implement Beam Search Decoder for Natural Language Processing
Photo by See1,Do1,Teach1, some rights reserved.

Decoder for Text Generation

In natural language processing tasks such as caption generation, text summarization, and machine translation, the prediction required is a sequence of words.

It is common for models developed for these types of problems to output a probability distribution over each word in the vocabulary for each word in the output sequence. It is then left to a decoder process to transform the probabilities into a final sequence of words.

You are likely to encounter this when working with recurrent neural networks on natural language processing tasks where text is generated as an output. The final layer in the neural network model has one neuron for each word in the output vocabulary and a softmax activation function is used to output a likelihood of each word in the vocabulary being the next word in the sequence.

Decoding the most likely output sequence involves searching through all the possible output sequences based on their likelihood. The size of the vocabulary is often tens or hundreds of thousands of words, or even millions of words. Therefore, the search problem is exponential in the length of the output sequence and is intractable (NP-complete) to search completely.

In practice, heuristic search methods are used to return one or more approximate or “good enough” decoded output sequences for a given prediction.

As the size of the search graph is exponential in the source sentence length, we have to use approximations to find a solution efficiently.

— Page 272, Handbook of Natural Language Processing and Machine Translation, 2011.

Candidate sequences of words are scored based on their likelihood. It is common to use a greedy search or a beam search to locate candidate sequences of text. We will look at both of these decoding algorithms in this post.

Each individual prediction has an associated score (or probability) and we are interested in output sequence with maximal score (or maximal probability) […] One popular approximate technique is using greedy prediction, taking the highest scoring item at each stage. While this approach is often effective, it is obviously non-optimal. Indeed, using beam search as an approximate search often works far better than the greedy approach.

— Page 227, Neural Network Methods in Natural Language Processing, 2017.

Need help with Deep Learning for Text Data?

Take my free 7-day email crash course now (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

Greedy Search Decoder

A simple approximation is to use a greedy search that selects the most likely word at each step in the output sequence.

This approach has the benefit that it is very fast, but the quality of the final output sequences may be far from optimal.

We can demonstrate the greedy search approach to decoding with a small contrived example in Python.

We can start off with a prediction problem that involves a sequence of 10 words. Each word is predicted as a probability distribution over a vocabulary of 5 words.

# define a sequence of 10 words over a vocab of 5 words
data = [[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1]]
data = array(data)

# define a sequence of 10 words over a vocab of 5 words

data = [[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1]]

data = array(data)

We will assume that the words have been integer encoded, such that the column index can be used to look-up the associated word in the vocabulary. Therefore, the task of decoding becomes the task of selecting a sequence of integers from the probability distributions.

The argmax() mathematical function can be used to select the index of an array that has the largest value. We can use this function to select the word index that is most likely at each step in the sequence. This function is provided directly in numpy.

The greedy_decoder() function below implements this decoder strategy using the argmax function.

# greedy decoder
def greedy_decoder(data):
	# index for largest probability each row
	return [argmax(s) for s in data]

# greedy decoder

def greedy_decoder(data):

# index for largest probability each row

return [argmax(s) for s in data]

Putting this all together, the complete example demonstrating the greedy decoder is listed below.

from numpy import array
from numpy import argmax

# greedy decoder
def greedy_decoder(data):
	# index for largest probability each row
	return [argmax(s) for s in data]

# define a sequence of 10 words over a vocab of 5 words
data = [[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1]]
data = array(data)
# decode sequence
result = greedy_decoder(data)
print(result)

from numpy import array

from numpy import argmax

# greedy decoder

def greedy_decoder(data):

# index for largest probability each row

return [argmax(s) for s in data]

# define a sequence of 10 words over a vocab of 5 words

data = [[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1]]

data = array(data)

# decode sequence

result = greedy_decoder(data)

print(result)

Running the example outputs a sequence of integers that could then be mapped back to words in the vocabulary.

[4, 0, 4, 0, 4, 0, 4, 0, 4, 0]

1	[4, 0, 4, 0, 4, 0, 4, 0, 4, 0]

Beam Search Decoder

Another popular heuristic is the beam search that expands upon the greedy search and returns a list of most likely output sequences.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

The local beam search algorithm keeps track of k states rather than just one. It begins with k randomly generated states. At each step, all the successors of all k states are generated. If any one is a goal, the algorithm halts. Otherwise, it selects the k best successors from the complete list and repeats.

— Pages 125-126, Artificial Intelligence: A Modern Approach (3rd Edition), 2009.

We do not need to start with random states; instead, we start with the k most likely words as the first step in the sequence.

Common beam width values are 1 for a greedy search and values of 5 or 10 for common benchmark problems in machine translation. Larger beam widths result in better performance of a model as the multiple candidate sequences increase the likelihood of better matching a target sequence. This increased performance results in a decrease in decoding speed.

In NMT, new sentences are translated by a simple beam search decoder that finds a translation that approximately maximizes the conditional probability of a trained NMT model. The beam search strategy generates the translation word by word from left-to-right while keeping a fixed number (beam) of active candidates at each time step. By increasing the beam size, the translation performance can increase at the expense of significantly reducing the decoder speed.

— Beam Search Strategies for Neural Machine Translation, 2017.

The search process can halt for each candidate separately either by reaching a maximum length, by reaching an end-of-sequence token, or by reaching a threshold likelihood.

Let’s make this concrete with an example.

We can define a function to perform the beam search for a given sequence of probabilities and beam width parameter k. At each step, each candidate sequence is expanded with all possible next steps. Each candidate step is scored by multiplying the probabilities together. The k sequences with the most likely probabilities are selected and all other candidates are pruned. The process then repeats until the end of the sequence.

Probabilities are small numbers and multiplying small numbers together creates very small numbers. To avoid underflowing the floating point numbers, the natural logarithm of the probabilities are added together, which keeps the numbers larger and manageable. Further, it is also common to perform the search by minimizing the score. This final tweak means that we can sort all candidate sequences in ascending order by their score and select the first k as the most likely candidate sequences.

The beam_search_decoder() function below implements the beam search decoder.

# beam search
def beam_search_decoder(data, k):
	sequences = [[list(), 0.0]]
	# walk over each step in sequence
	for row in data:
		all_candidates = list()
		# expand each current candidate
		for i in range(len(sequences)):
			seq, score = sequences[i]
			for j in range(len(row)):
				candidate = [seq + [j], score - log(row[j])]
				all_candidates.append(candidate)
		# order all candidates by score
		ordered = sorted(all_candidates, key=lambda tup:tup[1])
		# select k best
		sequences = ordered[:k]
	return sequences

# beam search

def beam_search_decoder(data, k):

sequences = [[list(), 0.0]]

# walk over each step in sequence

for row in data:

all_candidates = list()

# expand each current candidate

for i in range(len(sequences)):

seq, score = sequences[i]

for j in range(len(row)):

candidate = [seq + [j], score - log(row[j])]

all_candidates.append(candidate)

# order all candidates by score

ordered = sorted(all_candidates, key=lambda tup:tup[1])

# select k best

sequences = ordered[:k]

return sequences

We can tie this together with the sample data from the previous section and this time return the 3 most likely sequences.

from math import log
from numpy import array
from numpy import argmax

# beam search
def beam_search_decoder(data, k):
	sequences = [[list(), 0.0]]
	# walk over each step in sequence
	for row in data:
		all_candidates = list()
		# expand each current candidate
		for i in range(len(sequences)):
			seq, score = sequences[i]
			for j in range(len(row)):
				candidate = [seq + [j], score - log(row[j])]
				all_candidates.append(candidate)
		# order all candidates by score
		ordered = sorted(all_candidates, key=lambda tup:tup[1])
		# select k best
		sequences = ordered[:k]
	return sequences

# define a sequence of 10 words over a vocab of 5 words
data = [[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1],
		[0.1, 0.2, 0.3, 0.4, 0.5],
		[0.5, 0.4, 0.3, 0.2, 0.1]]
data = array(data)
# decode sequence
result = beam_search_decoder(data, 3)
# print result
for seq in result:
	print(seq)

from math import log

from numpy import array

from numpy import argmax

# beam search

def beam_search_decoder(data, k):

sequences = [[list(), 0.0]]

# walk over each step in sequence

for row in data:

all_candidates = list()

# expand each current candidate

for i in range(len(sequences)):

seq, score = sequences[i]

for j in range(len(row)):

candidate = [seq + [j], score - log(row[j])]

all_candidates.append(candidate)

# order all candidates by score

ordered = sorted(all_candidates, key=lambda tup:tup[1])

# select k best

sequences = ordered[:k]

return sequences

# define a sequence of 10 words over a vocab of 5 words

data = [[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1],

[0.1, 0.2, 0.3, 0.4, 0.5],

[0.5, 0.4, 0.3, 0.2, 0.1]]

data = array(data)

# decode sequence

result = beam_search_decoder(data, 3)

# print result

for seq in result:

print(seq)

Running the example prints both the integer sequences and their log likelihood.

Experiment with different k values.

[[4, 0, 4, 0, 4, 0, 4, 0, 4, 0], 6.931471805599453]
[[4, 0, 4, 0, 4, 0, 4, 0, 4, 1], 7.154615356913663]
[[4, 0, 4, 0, 4, 0, 4, 0, 3, 0], 7.154615356913663]

[[4, 0, 4, 0, 4, 0, 4, 0, 4, 0], 6.931471805599453]

[[4, 0, 4, 0, 4, 0, 4, 0, 4, 1], 7.154615356913663]

[[4, 0, 4, 0, 4, 0, 4, 0, 3, 0], 7.154615356913663]

Summary

In this tutorial, you discovered the greedy search and beam search decoding algorithms that can be used on text generation problems.

Specifically, you learned:

The problem of decoding on text generation problems.
The greedy search decoder algorithm and how to implement it in Python.
The beam search decoder algorithm and how to implement it in Python.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

51 Responses to How to Implement a Beam Search Decoder for Natural Language Processing

Noam January 5, 2018 at 11:04 pm #

This python/numpy example is really misleading. It inherently assumes that the generation of words is completely independent, or alternatively – that P(w_{t}|seq_{1:t-1}) = P(w_{t}).

In this scenario, the results of greedy-search will *always* be the same of the best results of the beam-search.

I suggest to fix the
score = best_score[“i prev”] + -log P(word[i]|next)
to –
score = best_score[“i prev”] +-log P(next|prev) + -log P(word[i]|next)

with a real RNN decoding example

Reply
- Jason Brownlee January 6, 2018 at 5:53 am #
  
  Thanks Noam.
  
  Reply
  - Shea February 18, 2018 at 11:06 am #
    
    I found that confusing too. I’m trying to understand why you wouldn’t always just choose the first candidate of the beam search (ie. just do the greedy search which is faster).
    
    Or perhaps there is some other way that you would determine that the not-best-greedy-seach-score-candidate is actually the best candidate?
    
    I’ll have to think a bit on @Noam’s suggestion.
    
    Reply
    - Shea February 18, 2018 at 11:08 am #
      
      Sorry I meant to say:
      
      Or perhaps there is some other way that you would determine that the not-best-BEAM-seach-score-candidate is actually the best candidate?
      
      Reply
      - Jason Brownlee February 19, 2018 at 9:01 am #
        
        The idea is to search the n-best paths through the probability sequence.
    - francis March 8, 2023 at 5:59 pm #
      
      What’s the closeness of that in transformer models?
      
      Reply
- Max April 3, 2018 at 6:46 am #
  
  So Noam, under your comments, -log P(next|prev) means the -log of the probability of the next char given the previous char, correct? For example say your predicting characters of a word, and predict an ‘r’. The next top 3 predictions after this ‘r’ are ‘r’, ‘a’, and ‘t’. Now, we have P(next|’r’), and most likely P(‘r’|’r’) and P(‘t’|’r’) will both be small so that the beam search correctly chooses P(‘a’|’r’) and outputs ‘a’ after our ‘r’, correct?
  
  Reply
- Lawrence April 10, 2018 at 1:12 pm #
  
  Totally agree with you. I know a simple example is helpful to understanding, but an over-simplified example like this only mislead readers.
  
  Reply
- Stefan October 25, 2018 at 6:38 pm #
  
  So it’s not just me 😀
  
  I was also confused by this as I realized beam_search with k>1 will still deliver always the same as greedy search.
  
  Reply
- RobrechtM February 22, 2019 at 5:03 am #
  
  Noam, is it possible to compute P(next|prev) out of the given probability distributions? I’m trying to implement beam search in an encoder-decoder architecture, but I think this is not possible without modifying the decoder.
  
  Reply
  - Noam November 6, 2020 at 1:10 am #
    
    Sorry just saw it now. I am not sure about implementation, but the idea is that you save not only the last word emitted but also the state (k copies of the state sequence). The state itself embeds the P(next|prev) so for example, predicting w_k from the state_best that w_{k-1} was, is different than predicting w_k from the state that an inferior candidate for the w_{k-1} state.
    
    see https://www.youtube.com/watch?v=RLWuzLLSIgw
    for more info
    
    Reply
Catherine March 6, 2018 at 8:17 pm #

Sir,

Can you please give me suggestions to do research work in machine learning.
I need the problem or specific research current trend in machine learning,

Regards
Catherine.

Reply
- Jason Brownlee March 7, 2018 at 6:11 am #
  
  Sorry, i cannot help you with your research topic.
  
  Reply
Surendra March 14, 2018 at 12:54 am #

How to combine and use Beam Search with ARPA based language model?

Reply
- Jason Brownlee March 14, 2018 at 6:26 am #
  
  What is ARPA?
  
  Reply
Surendra March 15, 2018 at 1:23 am #

ARPA is a format that is used to represent all possible word sequences from a corpus of text data. The ARPA file lists each possible word sequence and its statistically estimated language probability tagged to it.

The following link describes ARPA format in detail
https://cmusphinx.github.io/wiki/arpaformat/

I was wondering if ARPA file would be useful to select the best sequence from the output of beam search?

Reply
- Jason Brownlee March 15, 2018 at 6:31 am #
  
  Thanks.
  
  It might be. I am not across it.
  
  Reply
vikas dixit March 19, 2018 at 8:13 pm #

Sir is there any other heuristic or meta heuristic search algorithm which can replace beam search decoder ??

Reply
- Jason Brownlee March 20, 2018 at 6:15 am #
  
  Sure, you could use other search strategies.
  
  Reply

Phillip Glau April 28, 2018 at 12:44 pm #

I’m not sure I understand why the example multiplies log probabilities. Aren’t log probabilities normally added together to get the equivalent of multiplied real-value scalar probabilities?

0.5 * 0.5 * 0.25 = 0.0625

log(0.5) + log(0.5) + log(0.25) = -2.772588722239781
exp(-2.772588722239781) == 0.0624

whereas:
log(0.5) * log(0.5) * log(0.25) = -0.6660493039778589
exp(-0.6660493039778589) =~ 0.51374 != 0.625 ??

Linda June 7, 2018 at 3:30 am #

I think Philip Glau is correct. We’re supposed to add log probabilities and not multiply them.

Another issue with the code is the line
” for j in range(len(row))”

You’re iterating over all the data in row. In a typical text generation problem, there is a huge amount of words in a vocabulary, we probably don’t want to iterate over all of them? Rather we would be interested in only top k probabilities.

Reply
TheRightStef July 13, 2018 at 4:41 am #

+1 For Phillip’s comment.

Another problem is that log(1) = 0, so prod( log(p_i) ) ~ 0 for any sequence containing any character that is predicted with probability close to 1.0.

Ultimately, this leading to a pathological degeneracy of solutions during inference. In my results, it manifested itself as long runs of newline characters.

Reply

Chandresh Kumar Maurya December 11, 2019 at 9:10 pm #

The mathematical calculation of phillips is correct. But, if you do the calculation manually on a sample logits say [[1,2,3],[2,1,3],[3,1,2]], you will get correct ans. with score*log(prob.) and not with score+log(prob.) When I say the correct answer, it means that best score path of the beam search is same as greedy. Manual calculation of the above logits yields decoded sequence as [2,2,0] (score=27),[2,0,0,] (score=18), [1,2,0] (score=18). Which you can get via score*log formula. Further note that answer should not change whether you work with probability distribution or just the logits.

Chandresh Kumar Maurya December 11, 2019 at 11:02 pm #

Correction to my comment above: to maximize the log prob, you set score=0 and then score+log(prob). While sorting, set reverse=True in sorted(). Now, it will give the ans. as in my first comment.

Chandresh Kumar Maurya December 11, 2019 at 11:13 pm #

Implementation of the improved beam search

def beam_search_decoder(data, k):
    # first convert logits to probabilites so that all numbers are +ve
    data  = tf.nn.softmax(data)
    sequences = [[list(), 0.0]]
    # walk over each step in sequence
    for row in data:
        all_candidates = list()
        # expand each current candidate
        for i in range(len(sequences)):
            seq, score = sequences[i]
#             for j in range(len(row)): # instead of exploring all the labels, explore only k best at the current time
            # select k best
            best_k = np.argsort(row)[-k:]
            # explore k best
            for j in best_k:
                candidate = [seq + [j], score + tf.math.log(row[j])]
                all_candidates.append(candidate)
        # order all candidates by score
        ordered = sorted(all_candidates, key=lambda tup:tup[1], reverse=True)
        # select k best
        sequences = ordered[:k]
    return sequences

def beam_search_decoder(data, k):

# first convert logits to probabilites so that all numbers are +ve

data = tf.nn.softmax(data)

sequences = [[list(), 0.0]]

# walk over each step in sequence

for row in data:

all_candidates = list()

# expand each current candidate

for i in range(len(sequences)):

seq, score = sequences[i]

# for j in range(len(row)): # instead of exploring all the labels, explore only k best at the current time

# select k best

best_k = np.argsort(row)[-k:]

# explore k best

for j in best_k:

candidate = [seq + [j], score + tf.math.log(row[j])]

all_candidates.append(candidate)

# order all candidates by score

ordered = sorted(all_candidates, key=lambda tup:tup[1], reverse=True)

# select k best

sequences = ordered[:k]

return sequences

Ian Derrington July 30, 2018 at 9:22 am #

+1 for Phillips comment Definitely an error. Multiplying probabilities is equivalent to adding log-probabilities.

Reply
- Jason Brownlee July 30, 2018 at 2:15 pm #
  
  Thanks Ian.
  
  Reply
  - Thomas L. Packer May 10, 2020 at 11:29 am #
    
    Might want to fix the text, for those who do not read all the comments: “therefore, the negative log of the probabilities are multiplied”
    
    Reply
Mmed October 24, 2018 at 10:41 pm #

What should be done if one of the probabilities in the data array is a zero? There is a math error because of log(0).

Reply
- Jason Brownlee October 25, 2018 at 7:55 am #
  
  Good question, add a small float to the value before calculating the log, e.g. 1e-9
  
  Reply
Mmed November 7, 2018 at 9:38 am #

Hello Dr. Brownlee,

I am trying this on different data.

For some reason, I get almost identical captions with a probability of -0.0?

Any suggestions? I am combining this with the inference section from your caption generator https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

Reply
Mmed January 23, 2019 at 10:59 am #

Hello Dr. Brownlee,

Isn’t a higher log likelihood better? If so, then wouldn’t that mean that the 3rd ranked sequence is the best in this example? Because it has the highest log likelihood?

[[4, 0, 4, 0, 4, 0, 4, 0, 4, 0], 0.025600863289563108]
[[4, 0, 4, 0, 4, 0, 4, 0, 4, 1], 0.03384250043584397]
[[4, 0, 4, 0, 4, 0, 4, 0, 3, 0], 0.03384250043584397]

Why is the order in reverse?

Reply
Seth Stewart March 23, 2019 at 10:40 am #

Question: Why do you multiply logarithms instead of adding them, since the relevant property of the logarithm is that the product of two numbers is equivalently expressed by summing their logarithms?

Reply
- Seth Stewart March 23, 2019 at 10:42 am #
  
  Just saw this was already answered; my mind must be tired.
  
  Reply
Jay Urbain April 18, 2019 at 5:01 am #

Thanks. Can you recommend an example Keras seq2seq learning with beam search coding example? Would be very helpful.

Reply
- Jason Brownlee April 18, 2019 at 8:55 am #
  
  No, sorry, I don’t have an example.
  
  Reply
Di June 28, 2019 at 5:54 pm #

Dear Jason,
thank you very much for your tutorial.

I am working on a chatbot system in PyTorch and I would implement beam_search strategy.

During validation and testing, I use a batch size of 1, so my system sees only a sequence at time.

I have an encoder, which receives the source sequence, encodes this to a context vector and returns its internal states. These states are used to initialize the decoder internal states. The decoder first word is my start-of-sequence token.

After looping for a given max length, the decoder returns some results, which are passed through a Dense Layer and then through a log_softmax operation which gives me predictions.

Now, suppose that I have a max length of 10 words, a mini-vocabulary of 50 words, as a result I have: [10, batch size, 50].

I could basically take [10, 50] and pass this to your function, retrieving the best candidates? That means, the system is only run once on the source sequence, then we only search on its results?

Reply
- Jason Brownlee June 29, 2019 at 6:39 am #
  
  Generally, yes. Specifically, perhaps try it?
  
  Reply
Cassandra January 23, 2020 at 5:08 am #

I think it should read “natural logarithm of the probabilities are added together”. Adding log probabilities is equivalent to multiplying probabilities.

Reply
Wenjing Liu March 11, 2020 at 3:42 pm #

Hello Dr. Brownlee. I found without storing hidden states at step t-1 for step t, an RNN decoder won’t generate proper sequences. This is my beam search decoder implementation of a CNN-LSTM model for image captioning tasks. https://colab.research.google.com/drive/1-XV3yQhhslY144A5RHJrfWqILOv2iipv?authuser=2#scrollTo=eKtn21uj_K1z&line=7&uniqifier=1

Reply
- Jason Brownlee March 12, 2020 at 8:38 am #
  
  Well done.
  
  Reply
Jovan93 May 25, 2020 at 9:17 am #

Hi Jason, this has been very informative article. I would like to ask what other heuristic algorithms can be used besides greedy and beam search?

Thanks

Reply
- Jason Brownlee May 25, 2020 at 1:24 pm #
  
  Thanks.
  
  Good question, perhaps check the literature.
  
  Reply
may June 19, 2020 at 3:13 pm #

could you explain about trajectory beam search with example like this?

Reply
- Jason Brownlee June 20, 2020 at 6:06 am #
  
  Thanks for the suggestion.
  
  Reply

Mahmoud Wahdan July 4, 2020 at 10:39 pm #

Hi Jason,

I think there is a better implementation for beam_search_decoder.
Consider that we are talking about decoding output from Machine Translation model, where the input data is n * m where n is number of words generated by the MT model and m is the number of words in the target language vocabulary.
You algorithm take O(n m k + n m log(m)) and because m will be very large (ex: 20,000 words), we can say that your algorithm will take O(n m log(m))

Below is my implementation and it takes only O(n m + n k2 + n k log(k)) which is O(n m)

import numpy as np
import math
import heapq

def beam_search_decoder(data, k):
    """
    data: (n, m) where n is number of words in sequence.
        and m is number of classes (words in target vocab).
    k: beam search parameter
    """
    sequences = [[[], 0.0]]
    # walk over each step in sequence
    for row in data: # ----> n
        all_candidates = []
        # find the indexes of k largest probabilities in the row
        k_largest = heapq.nlargest(k, range(len(row)), row.take) # -----> m
        # expand each current candidate
        for seq, score in sequences: # ----> k
            for j in k_largest: # -----> k
                s = score - math.log(row[j])
                candidate = [seq + [j], s]
                all_candidates.append(candidate)
        # sort all candidates by score
        ordered = sorted(all_candidates, key=lambda tup:tup[1]) # -----> k log k
        # select best k
        sequences = ordered[:k]
    return sequences

import numpy as np

import math

import heapq

def beam_search_decoder(data, k):

"""

data: (n, m) where n is number of words in sequence.

and m is number of classes (words in target vocab).

k: beam search parameter

"""

sequences = [[[], 0.0]]

# walk over each step in sequence

for row in data: # ----> n

all_candidates = []

# find the indexes of k largest probabilities in the row

k_largest = heapq.nlargest(k, range(len(row)), row.take) # -----> m

# expand each current candidate

for seq, score in sequences: # ----> k

for j in k_largest: # -----> k

s = score - math.log(row[j])

candidate = [seq + [j], s]

all_candidates.append(candidate)

# sort all candidates by score

ordered = sorted(all_candidates, key=lambda tup:tup[1]) # -----> k log k

# select best k

sequences = ordered[:k]

return sequences

The output of this implementation is the same as the output of your implementation.
And this was asserted using many trials of random generated input of size (100, 20000) and k=3 this implementation is +15x faster.

The idea is that you don’t need to go through the scores/probabilities of each target class every time, instead you need to take the k largest scores only.

Jason Brownlee July 5, 2020 at 7:04 am #

Thanks for sharing.

Reply

Lawrence Xu October 26, 2020 at 8:50 pm #

HI Jason,

The simple and easy to understand tutorial of Beam search i have ever read.

On line 15 candidate = [seq + [j], score – log(row[j])].

Since log A + Log B = Log (AB)

why we have “score – log(row[j])]” not ” score + log(row[j])].”

Reply
- Jason Brownlee October 27, 2020 at 6:43 am #
  
  Thanks.
  
  Reply
Hadi November 27, 2020 at 6:16 am #

Hello Jason,
As always it was really helpful.
I have a question regarding the beam search. Can we have a manual beam search? For instance, knowing the output should “thank you”, can we say what is its beam score?
Thanks

Reply
- Jason Brownlee November 27, 2020 at 6:45 am #
  
  Sure. As in the sum log probability as the score.
  
  Reply

Navigation

How to Implement a Beam Search Decoder for Natural Language Processing

Decoder for Text Generation

Need help with Deep Learning for Text Data?

Greedy Search Decoder

Beam Search Decoder

Further Reading

Summary

Develop Deep Learning models for Text Data Today!

Develop Your Own Text models in Minutes

Finally Bring Deep Learning to your Natural Language Processing Projects

More On This Topic

51 Responses to How to Implement a Beam Search Decoder for Natural Language Processing

Leave a Reply Click here to cancel reply.