# Learn to Add Numbers with an Encoder-Decoder LSTM Recurrent Neural Network

Last Updated on August 27, 2020

Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN) that are capable of learning the relationships between elements in an input sequence.

A good demonstration of LSTMs is to learn how to combine multiple terms together using a mathematical operation like a sum and outputting the result of the calculation.

A common mistake made by beginners is to simply learn the mapping function from input term to the output term. A good demonstration of LSTMs on such a problem involves learning the sequenced input of characters (“50+11”) and predicting the sequence output in characters (“61”). This hard problem can be learned with LSTMs using the sequence-to-sequence, or seq2seq (encoder-decoder), stacked LSTM configuration.

In this tutorial, you will discover how to address the problem of adding sequences of randomly generated integers using LSTMs.

After completing this tutorial, you will know:

• How to learn the naive mapping function of input terms to output terms for addition.
• How to frame the addition problem (and similar problems) and suitably encode inputs and outputs.

Kick-start your project with my new book Long Short-Term Memory Networks With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• Update Aug/2018: Fixed typos in description of model configuration.

How to Learn to Add Numbers with seq2seq Recurrent Neural Networks
Photo by Lima Pix, some rights reserved.

## Tutorial Overview

This tutorial is divided into 3 parts; they are:

2. Addition as a Mapping Problem (the beginner’s mistake)
3. Addition as a seq2seq Problem

### Environment

This tutorial assumes a Python 2 or Python 3 development environment with SciPy, NumPy, Pandas installed.

The tutorial also assumes scikit-learn and Keras v2.0+ are installed with either the Theano or TensorFlow backend.

If you need help with your environment, see the post:

The task is that, given a sequence of randomly selected integers, to return the sum of those integers.

For example, given 10 + 5, the model should output 15.

The model is to be both trained and tested on randomly generated examples so that the general problem of adding numbers is learned, rather than memorization of specific cases.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Addition as a Mapping Problem (the beginner’s mistake)

In this section, we will work through the problem and solve it using an LSTM and show how easy it is to make the beginner’s mistake and not harness the power of recurrent neural networks.

### Data Generation

Let’s start off by defining a function to generate sequences of random integers and their sum as training and test data.

We can use the randint() function to generate random integers between a min and max value, such as between 1 and 100. We can then sum the sequence. This process can be repeated for a fixed number of times to create pairs of input sequences of numbers and matching output summed values.

For example, this snippet will create 100 examples of adding 2 numbers between 1 and 100:

Running the example will print each input-output pair.

Once we have the patterns, we can convert the lists to NumPy Arrays and rescale the values. We must rescale the values to fit within the bounds of the activation used by the LSTM.

For example:

Putting this all together, we can define the function random_sum_pairs() that takes a specified number of examples, a number of integers in each sequence, and the largest integer to generate and return X, y pairs of data for modeling.

We may want to invert the rescaling of numbers later. This will be useful to compare predictions to expected values and get an idea of an error score in the same units as the original data.

The invert() function below inverts the normalization of predicted and expected values passed in.

### Configure LSTM

We can now define an LSTM to model this problem.

It’s a relatively simple problem, so the model does not need to be very large. The input layer will expect 1 input feature and 2 time steps (in the case of adding two numbers).

Two hidden LSTM layers are defined, the first with 6 units and the second with 2 units, followed by a fully connected output layer that returns a single sum value.

The efficient ADAM optimization algorithm is used to fit the model along with the mean squared error loss function given the real valued output of the network.

The network is fit for 100 epochs, new examples are generated each epoch and weight updates are performed at the end of each batch.

### LSTM Evaluation

We evaluate the network on 100 new patterns.

These are generated and a sum value is predicted for each. Both the actual and predicted sum values are rescaled to the original range and a Root Mean Squared Error (RMSE) score is calculated that has the same scale as the original values. Finally, some 20 examples of expected and predicted values are listed as examples.

Finally, 20 examples of expected and predicted values are listed as examples.

## Complete Example

We can tie this all together. The complete code example is listed below.

Running the example prints some loss information each epoch and finishes by printing the RMSE for the run and some example outputs.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The results are not perfect, but many examples are predicted correctly.

### The Beginner’s Mistake

All done, right?

Wrong.

The problem we have solved had multiple inputs but was technically not a sequence prediction problem.

In fact, you can just as easily solve it using a multilayer Perceptron (MLP). For example:

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example solves the problem perfectly, and in fewer epochs.

The issue is that we encoded so much of the domain into the problem that it turned the problem from a sequence prediction problem into a function mapping problem.

That is, the order of the input no longer matters. We could shuffle it up any way we want and still learn the problem.

MLPs are designed to learn mapping functions and can easily nail the problem of learning how to add numbers.

On one hand, this is a better way to approach the specific problem of adding numbers because the model is simpler and the results are better. On the other, it is a terrible use of recurrent neural networks.

This is a beginner’s mistake I see replicated in many “introduction to LSTMs” around the web.

## Addition as a Sequence Prediction Problem

There is another way to frame addition that makes it an unambiguous sequence prediction problem, and in turn makes it much harder to solve.

We can frame addition as an input and output string of characters and let the model figure out the meaning of the characters. The entire addition problem can be framed as a string of characters, such as “12+50” with the output “62”, or more specifically:

• Input: [‘1’, ‘2’, ‘+’, ‘5’, ‘0’]
• Output: [‘6’, ‘2’]

The model must learn not only the integer nature of the characters, but also the nature of the mathematical operation to perform.

Notice how sequence is now important, and that randomly shuffling the input will create a nonsense sequence that could not be related to the output sequence.

Also notice how the problem has transformed to have both an input and an output sequence. This is called a sequence-to-sequence prediction problem, or a seq2seq problem.

We can keep things simple with addition of two numbers, but we can see how this may be scaled to a variable number of terms and mathematical operations that could be given as input for the model to learn and generalize.

Note that this formation and the rest of this example is inspired by the addition seq2seq example in the Keras project, although I re-developed it from the ground up.

### Data Generation

Data generation for the seq2seq definition of the problem is a lot more involved.

We will develop each piece as a standalone function so you can play with them and understand how they work. Hang in there.

The first step is to generate sequences of random integers and their sum, as before, but with no normalization. We can put this in a function named random_sum_pairs(), as follows.

Running just this function prints a single example of adding two random integers between 1 and 10.

The next step is to convert the integers to strings. The input string will have the format ’99+99′ and the output string will have the format ’99’.

Key to this function is the padding of numbers to ensure that each input and output sequence has the same number of characters. A padding character should be different from the data so the model can learn to ignore them. In this case, we use the space character for padding(‘ ‘) and pad the string on the left, keeping the information on the far right.

There are other ways to pad, such as padding each term individually. Try it and see if it results in better performance. Report your results in the comments below.

Padding requires we know how long the longest sequence may be. We can calculate this easily by taking the log10() of the largest integer we can generate and the ceiling of that number to get an idea of how many chars are needed for each number. We add 1 to the largest number to ensure we expect 3 chars instead of 2 chars for the case of a round largest number, like 200. We then need to add the right number of plus symbols.

A similar process is repeated on the output sequence, without the plus symbols of course.

The example below adds the to_string() function and demonstrates its usage with a single input/output pair.

Running this example first prints the integer sequence and the padded string representation of the same sequence.

Next, we need to encode each character in the string as an integer value. We have to work with numbers in neural networks after all, not characters.

Integer encoding transforms the problem into a classification problem, where the output sequence may be considered class outputs with 11 possible values each. This just so happens to be integers with some ordinal relationship (first 10 class values).

To perform this encoding, we must define the full alphabet of symbols that may appear in the string encoding, as follows:

Integer encoding then becomes a simple process of building a lookup table of character to integer offset and converting each char of each string, one by one.

The example below provides the integer_encode() function for integer encoding and demonstrates how to use it.

Running the example prints the integer encoded version of each string encoded pattern.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the space character (‘ ‘) was encoded with 11 and the three character (‘3’) was encoded as 3, and so on.

The next step is to binary encode the integer encoding sequences.

This involves converting each integer to a binary vector with the same length as the alphabet and marking the specific integer with a 1.

For example, a 0 integer represents the ‘0’ character and would be encoded as a binary vector with a 1 in the 0th position of an 11 element vector: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The example below defines the one_hot_encode() function for binary encoding and demonstrates how to use it.

Running the example prints the binary encoded sequence for each integer encoding.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

I’ve added some new lines to make the input and output binary encodings clearer.

You can see that a single sum pattern becomes a sequence of 5 binary encoded vectors, each with 11 elements. The output or sum becomes a sequence of 2 binary encoded vectors, again each with 11 elements.

We can tie all of these steps together into a function called generate_data(), listed below.

Finally, we need to invert the encoding to convert the output vectors back into numbers so we can compare expected output integers to predicted integers.

The invert() function below performs this operation. Key is first converting the binary encoding back into an integer using the argmax() function, then converting the integer back into a character using a reverse mapping of the integers to chars from the alphabet.

We now have everything we need to prepare data for this example.

Note, these functions were written for this post and I did not write any unit tests nor battle test them with all kinds of inputs. If you see or find an obvious bug, please let me know in the comments below.

### Configure and Fit a seq2seq LSTM Model

We can now fit an LSTM model to this problem.

We can think of the model as being comprised of two key parts: the encoder and the decoder.

First, the input sequence is shown to the network one encoded character at a time. We need an encoding level to learn the relationship between the steps in the input sequence and develop an internal representation of these relationships.

The input to the network (for the two number case) is a series of 5 encoded characters (2 for each integer and one for the ‘+’) where each vector contains 11 features for the 11 possible characters that each item in the sequence may be.

The encoder will use a single LSTM hidden layer with 100 units.

The decoder must transform the learned internal representation of the input sequence into the correct output sequence. For this, we will use a hidden layer LSTM with 50 units, followed by an output layer.

The problem is defined as requiring two binary output vectors for the two output characters. We will use the same fully connected layer (Dense) to output each binary vector. To use the same layer twice, we will wrap it in a TimeDistributed() wrapper layer.

The output fully connected layer will use a softmax activation function to output values in the range [0,1].

There’s a problem though.

We must connect the encoder to the decoder and they do not fit.

That is, the encoder will produce a 2-dimensional matrix of 100 outputs for each input in the sequence of 5 vectors. The decoder is an LSTM layer that expects a 3D input of [samples, timesteps, features] in order to produce a decoded sequence of 1 sample with 2 timesteps each with 11 features.

If you try to force these pieces together, you get an error like:

Exactly as we would expect.

We can solve this using a RepeatVector layer. This layer simply repeats the provided 2D input n-times to create a 3D output.

The RepeatVector layer can be used like an adapter to fit the encoder and decoder parts of the network together. We can configure the RepeatVector to repeat the input 2 times. This creates a 3D output comprised of two copies of the sequence output from the encoder, that we can decode two times using the same fully connected layer for each of the two desired output vectors.

The problem is framed as a classification problem with 11 classes, therefore we can optimize the log loss (categorical_crossentropy) function and even track accuracy as well as loss on each training epoch.

Putting this together, we have:

### Why Use a RepeatVector Layer?

Why not return the sequence output from the encoder as input for the decoder?

That is, one output for each LSTM at each input sequence time step rather than one output for each LSTM for the whole input sequence.

An output for each step of the input sequence gives the decoder access to the intermediate representation of the input sequence each step. This may or may not be useful. Providing the final LSTM output at the end of the input sequence may be more logical as it captures information about the entire input sequence, ready to map to or calculate an output.

Also, this leaves nothing in the network to specify the size of the decoder other than the input, giving one output value for each timestep of the input sequence (5 instead of 2).

You could reframe the output to be a sequence of 5 characters padded with whitespace. The network would be doing more work than is required and may lose some of the compression type capability provided by the encoder-decoder paradigm. Try it and see.

The issue titled “is the Sequence to Sequence learning right?” on the Keras GitHub project provides some good discussions of alternate representations you could play with.

### Evaluate LSTM Model

As before, we can generate a new batch of examples and evaluate the algorithm after it has been fit.

We could calculate an RMSE score on the prediction, although I have left it out for simplicity here.

### Complete Example

Putting it all together, the complete example is listed below.

Running the example nearly perfectly fits the problem. In fact, running for more epochs or increasing weight updates to every epoch (batch_size=1) will get you there, but will take 10 times longer to train.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

We can see that the predicted outcome matches the expected outcome on the first 20 examples we look at.

## Extensions

This section lists some natural extensions to this tutorial that you may wish to explore.

• Integer Encoding. Explore whether the problem can learn the problem better using an integer encoding alone. The ordinal relationship between most of the inputs may prove very useful.
• Variable Numbers. Change the example to support a variable number of terms on each input sequence. This should be straightforward as long as you perform sufficient padding.
• Variable Mathematical Operations. Change the example to vary the mathematical operation to allow the network to generalize even further.
• Brackets. Allow the use of brackets along with other mathematical operations.

Did you try any of these extensions?
Share your findings in the comments; I’d love to see what you found.

This section lists some resources for further reading and other related examples you may find useful.

## Summary

In this tutorial, you discovered how to develop an LSTM network to learn how to add random integers together using the seq2seq stacked LSTM paradigm.

Specifically, you learned:

• How to learn the naive mapping function of input terms to output terms for addition.
• How to frame the addition problem (and similar problems) and suitably encode inputs and outputs.

Do you have any questions?

## Develop LSTMs for Sequence Prediction Today!

#### Develop Your Own LSTM models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more...

### 71 Responses to Learn to Add Numbers with an Encoder-Decoder LSTM Recurrent Neural Network

1. xi May 20, 2017 at 6:22 pm #

LOVE U

2. Chauhan Hardik May 20, 2017 at 9:56 pm #

How can we write code for decoder network whose input is encoder’s memory plus previous time step’s output ?

• Jason Brownlee May 21, 2017 at 5:59 am #

Good question, I have not done this myself yet. It may require a careful network design.

3. Giuseppe Bonaccorso May 21, 2017 at 1:05 am #

Hi,
very interesting article. I’ve written an extension for coping with more complex expressions (it’s available on this GIST: https://gist.github.com/giuseppebonaccorso/d6e5bee6d50480344493b66f88fc414b)

Unfortunately, there are still many errors, but it’s probably due to the size of the training dataset (which doesn’t contain all possible examples). That’s probably the hardest part of Seq2Seq, I mean creating a model which can also learn semantics, so that can be easily trained with fewer examples and with always better performances.

I’m continuing my experiments!

• Jason Brownlee May 21, 2017 at 6:00 am #

Well done, keep experimenting Giuseppe!

4. max ales May 21, 2017 at 3:27 am #

I’m sorry but i get this error. What’s wrong?

Using Theano backend.
Traceback (most recent call last):
File “test.py”, line 110, in
File “/usr/local/lib/python2.7/dist-packages/keras/models.py”, line 430, in add
layer(x)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py”, line 257, in __call__
return super(Recurrent, self).__call__(inputs, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/keras/engine/topology.py”, line 578, in __call__
output = self.call(inputs, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py”, line 295, in call
preprocessed_input = self.preprocess_input(inputs, training=None)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py”, line 1028, in preprocess_input
timesteps, training=training)
File “/usr/local/lib/python2.7/dist-packages/keras/layers/recurrent.py”, line 58, in _time_distributed_dense
x = K.reshape(x, (-1, timesteps, output_dim))
File “/usr/local/lib/python2.7/dist-packages/keras/backend/theano_backend.py”, line 739, in reshape
y = T.reshape(x, shape)
File “/usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano/tensor/basic.py”, line 4910, in reshape
rval = op(x, newshape)
File “/usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano/gof/op.py”, line 615, in __call__
node = self.make_node(*inputs, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano/tensor/basic.py”, line 4748, in make_node
raise TypeError(“Shape must be integers”, shp, shp.dtype)
TypeError: (‘Shape must be integers’, TensorConstant{[ -1. 5. 100.]}, ‘float64’)

• Jason Brownlee May 21, 2017 at 6:02 am #

Ensure you have copied the example exactly without any extra white space.

Also ensure you are using Keras 2.0 or higher.

• giorgio borghi May 21, 2017 at 9:27 pm #

i wrote carefully

raise TypeError(“Shape must be integers”, shp, shp.dtype)
TypeError: (‘Shape must be integers’, TensorConstant{[ -1. 5. 100.]}, ‘float64’)

• Jason Brownlee May 22, 2017 at 7:53 am #

I have not seen this error before, sorry.

Do any other examples work on your system?

• Giorgio borghi May 23, 2017 at 9:57 pm #

Yes . Keras is 2.0.4 and your post is too important and awesome ! Great job !

• RealUser404 June 25, 2017 at 9:22 am #

I got almost the same error when just copy/pasting the code :
TypeError: Value passed to parameter ‘shape’ has DataType float32 not in list of allowed values: int32, int64

• Jason Brownlee June 26, 2017 at 6:03 am #

Sorry, I have not seen this error. Ensure you have the latest version of all the libraries installed.

See if you can pull the example apart and narrow down the line the fails. It looks like numpy and the .shape property. Perhaps your numpy is not up to date. shape always returns an int. No idea what this error could be,.

• RealUser404 June 26, 2017 at 9:54 am #

Well it was due to the n_out_seq_length = ceil(log10(n_numbers * (largest+1))) lines.

I do not know why it does that, as ceil is supposed to return an integer.
If I change manually the value to n_out_seq_length = 2, then it works correctly.

• Jason Brownlee June 27, 2017 at 8:20 am #

Interesting.

5. Birkey May 21, 2017 at 12:20 pm #

Nice work Jason!
For seq2seq problem, RNN is the choice by default. But it seems quite subtle to choose between MLP vs RNN for seq2vec problem. In the case discussed, if we “encode too much domain knowledge into the problem” we turned it into a function mapping problem, here by “domain knowledge ” we means “split addition equation into added and addon” . But this kind of “Data preprocessing” is not uncommon for input sequence.

Q: do you have any principles or guidelines for model choosing for seq2vec problem?

BTW: I have read the keras issue you recommend. Maybe there’s answer

• Jason Brownlee May 22, 2017 at 7:52 am #

Great question.

It really comes down to what performs best on your problem. You may want a seq2seq formulation because of the elegance, but a mapping-based solution (seq2vec) may just perform better.

In practice, I would try a suite of methods and double down on what works best.

6. Kunpeng Zhang May 30, 2017 at 11:47 pm #

Is seq2seq can be used for time series prediction? Given you delivered lots of wonderful posts over time series, what if translate traditional time series prediction to a seq2seq formulation?
Could you give me some advice?

• Jason Brownlee June 2, 2017 at 12:37 pm #

Yes, if you have a different number of output time steps for each one input time step.

• Kunpeng Zhang June 6, 2017 at 1:24 am #

7. Mehrdad May 31, 2017 at 3:00 am #

Hi Jason. Thanks for your great article.
Something is not clear for me. LSTMs are capable of learning series (sequences) with different length. Why we should use padding to have sequences with same length?

• Jason Brownlee June 2, 2017 at 12:39 pm #

Here, the difference is between the lengths of input and output sequences, not differences in the lengths of input sequences themselves.

• mehrdadscomputer June 5, 2017 at 10:07 pm #

Thanks Jason.

I did some searches and it seems that after padding, it’s essential to use masking or initializing sample_weight to say model to ignore white spaces. It doesn’t learn to automatically ignore padded characters (I had a wrong assumption about it)

I did masking for this problem and the result is as follow:

The result without masking for 10 epochs is: loss: 0.6787 – acc: 0.8185
The result with masking for 10 epochs is: loss: 0.6202 – acc: 0.8705

As you see, there is too much improvement. But may please say how we can use apply sample_weight to this problem?

The improvement achieved by changing this part of code:

model = Sequential()

• Jason Brownlee June 6, 2017 at 9:35 am #

Great work! Thanks for reporting your results.

I have a post on masking scheduled that should provide further help.

What do you mean sample weight? Weighting input samples? I guess neural nets do this already. You could also resample your training data to affect the sampling distribution of different classes.

• mehrdad12 June 8, 2017 at 6:47 pm #

Thanks Jason.
I run my model many times and it seems that the improvement is because of randomness and not masking, unfortunately.

Yes, I mean weighting input samples.

I asked about this in Keras’ slack channel and some guys told me that neural network is not capable of learning to ignore padded part of time series and I need to do it by weighting input samples

• Jason Brownlee June 9, 2017 at 6:22 am #

Perhaps you could ask what this person meant.

Off the cuff, I cannot see how weighting inputs would help with masking.

8. RealUser404 June 25, 2017 at 10:17 am #

This was a great tutorial, thanks a lot!

I have a question concerning the size of the LSTM layers. You chose to use a LSTM layer of 100 neurons for the encoder, and 50 for the decoder. How did you choose those values? Are these somehow related to the size of the input and size of the output? Does that mean if I use an input twice as big (numbers of 4 digits instead of 2) I should be doubling the number of neurons too?

• Jason Brownlee June 26, 2017 at 6:04 am #

No, just a little trial and error.

Experiment and see how different values affect model skill.

9. anurag September 4, 2017 at 12:06 am #

Thanks for the nice post, I want to use the encoder – decoder lstm network to extract features for shampoo sales data to use the features for multistep forecasting.

def lstm_autoencoder(train_data, target, timesteps, repeat_vec, batch_size, epochs, ls_cells=[10, 6], lr=0.01):
train_data = train_data.reshape(train_data.shape[0], timesteps, train_data.shape[1])
target = target.reshape(target.shape[0], timesteps, target.shape[1])
model = Sequential()
print(model.summary())
# train LSTM
tr_mse, val_mse = list(), list()
for i in range(epochs):
print(‘Epoch :’, str(i) )
history = model.fit(train_data, target, epochs=1, batch_size=1, verbose=2, shuffle=False, validation_split=0.1)
tr_mse.append(history.history[‘loss’])
val_mse.append(history.history[‘val_loss’])
return model, history, tr_mse, val_mse

model, history, tr_mse, val_mse = lstm_autoencoder(X_scaled, y_scaled, timesteps=1,
repeat_vec=1, batch_size=1,
epochs=80, ls_cells=[20, 16, 7],
lr=0.002)

After training the network, I am predicting……………..
X_train = X_train.reshape(X_train.shape[0], 1, 1)
result = model.predict(X_train, batch_size=1, verbose=2)

Is this the correct way, as my idea was to do it in an unsupervised way…

• anurag September 4, 2017 at 12:10 am #

I wanted to understand how to use the network in an unsupervised way to reduce the dimension to a fixed length vector(encode) and then reconstruct (decode) , later on this reduced dimension of fixed length , I would like to train a LSTM for multi-step forecasts.

• Jason Brownlee September 4, 2017 at 4:36 am #

Yes, but it would not be unsupervised, it is supervised.

• Jason Brownlee September 4, 2017 at 4:35 am #

You could use the encoder-decoder for time series.

The key is to choose the length of the input sequences and the length of the context vector, experiment with both. I’d love to hear how you go.

This post might give you a cleaner template as a starting point:
https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

10. BouBou September 27, 2017 at 2:35 am #

Thanks for this detailed article!

I have a question: what is the initial state for both the encoder and the decoder?

It appears that instead of “sending” the last state of the encoder to the decoder as initial state, the last output of the encoder is “send”/fed to the decoder as input, right? But the last output of the encoder corresponds actually to the last time step input data … and not a representation of all the input data … Moreover, the difference between two time steps in the decoder is only the hidden state (because the input of each time step in the decoder == last encoder output) …

A second question is in case you have tested the solution that consists in “sending” the last hidden state of the encoder to the decoder or using the attention mechanism and where each decoder input is the output of the last decoder output time step, could you precise how you have achieved this in Keras or only Tensorflow?

• Jason Brownlee September 27, 2017 at 5:46 am #

When you refer to state, are you referring to the internal variables within each memory unit, or the output of the encoder or decoder model? I think you may have these elements mixed up.

• BouBou September 27, 2017 at 7:40 pm #

I refer to the hidden state and cell states (for LSTM). They are transferred from an LSTM cell to another through time steps …
At each time step, a LSTM cell:
– has as input: X (batch_size, n_features), the previous hidden state h, the previous cell/memory state c,
– produces: Y (batch_size, n_hidden_units), the hidden state h, the cell/memory state c.

Thanks !

11. BouBou September 28, 2017 at 10:44 pm #

If I can precise my question: The input of the decoder is a constant vector and this vector is the last output of the encoder y(n); but this last output fits to the encoder input x(n).
For instance:
number of time steps in the encoder = n
number of time steps in the decoder = m
encoder input = x:{x1, x2, x3 … xn}
encoder outputs = y:{y1,y2,y3 … yn}
encoder states = he:{he0,he1….hen}
decoder input = y’:{yn,yn …yn} (m times)
decoder output = z:{z1,z2 … zm}
decoder states= hd:{hd0,hd1, hd2… hdm}
So, my questions are:
– what is he0? (initial encoder state)
– what is hd0 ? (initial decoder state .. I think that it should be hen == hd0)
– yn = f[he(n-1), xn] is the input of the decoder, where is the information about x1 … x(n-1) to decode properly?

Thanks !!

• Jason Brownlee September 29, 2017 at 5:05 am #

The encoder and decoder initial hidden state values are 0.

12. Tony October 31, 2017 at 10:49 pm #

Hi, Jason, thank you for your posts. I was wondering about the first solution in this one, where a MLP is used to learn to sum numbers. The values used in the training are such that the trained model “sees” 5,000 examples (n_examples * n_epoch) if I’m not mistaken and the possible values for the two input numbers are between 1 and 100, which makes 5,050 (100 * 99 / 2 + 100) different possible pairs. Of course, the 5,000 examples are not unique but still: doesn’t this turn the MLP more into a hash table of all possible pairs for this particular example?

Still, it looks like if we increase the value of largest without touching the other vars, the approach still works mostly due to the normalization, I suppose. I find that quite amazing.

• Tony October 31, 2017 at 10:58 pm #

Actually, now that I think about it more: the model needs to learn a simple linear function, so with enough training examples it is normal for it to be able to generalise to numbers with any size. Probably there is some minimum number of training examples that are enough for it to any size of the numbers. My experiments show that n_examples = 100, n_epoch = 100 gets the work done.

• Jason Brownlee November 1, 2017 at 5:49 am #

Yes, the model must generalize a mapping function. Well done on seeing that.

13. Thabet November 22, 2017 at 3:59 pm #

Thank you Jason!

14. Jay December 4, 2017 at 12:43 am #

I notice that you are training on 30,000 examples (= 30 epochs x 1000 samples per epoch).
Also, the largest numbers being added in both the train and test sets are 10 – which probably means there are a max of 100 unique pairs (10 x 10) of addition examples in the domain.

Since 30,000 training examples is way larger (>>) than the the available unique pairs for addition (i.e. 100) – Do you think the model has simply memorized the results ?

At first glance – It seems like the “test data” might NOT really be new data that the model is seeing.

Experimented with increasing “largest” to a value of 100 – which means 10,000 unique addition possibilities. But this time- -the test accuracy was way lower(~ 58%) and it seemed like the the predictions on the test set were all incorrect (though they were pretty close in many cases).

So – Is the LSTM really learning ??

• Jason Brownlee December 4, 2017 at 7:56 am #

It’s a good question that requires careful testing to answer.

I don’t think it is getting enough exposures/or has sufficient capacity to memorize, but you could be right.

A better evaluation of the model would be to train it on a subset of possible pairs and test it on the held out pairs to see if indeed it learns addition.

15. ming February 6, 2018 at 10:42 am #

Hi, Jason, i have a data input that is a sequence of hierarchical structures, say Category, sub_category, subsub_category (for example, book categories, ) and another input that is shared across each category (say book publishing year), the output is a feature of the book. i was thinking to use the model from this blog but not quite sure. What kind of model do you suggest?

• Jason Brownlee February 7, 2018 at 9:19 am #

I recommend brainstorming a suite of different framings of the problem, try each and see what works best.

16. hanban April 9, 2018 at 10:58 am #

Thank you Jason, Great post. I am confused about the ‘unit’ problem. You set 100 units for the encoder layer which says it contains 100 hidden states right?One unit take in the last hidden state and some input x then produce some outputs, am i right so far? I am confused that where do these input x come from? In my picture one unit take in one input which is a digit or a ‘+’. What mistake do i make?

• Jason Brownlee April 10, 2018 at 6:11 am #

Each unit will take the input. It means all the units in the input layer are doing the same work in parallel.

17. ihm May 30, 2018 at 7:46 pm #

Thank you Jason, Great post. if i want lead this article go translate to Thai language ,Will you give?
I’m translate for Training .
I’m sorry , I suck at English.

18. Ashley June 5, 2018 at 5:19 am #

Hi. I have an autoencoder with 5 layers LSTM(512) -> LSTM(64) -> LSTM(32) -> LSTM(64) -> LSTM(512) -> Dense(1) and I am getting an error: expected dense_4 to have 3 dimensions, but got array with shape (272, 1), when running without it.

So I try and add this line: model.add(RepeatVector(2)) after my LSTM(32) layer and I get this error: ValueError: Input 0 is incompatible with layer repeat_vector_2: expected ndim=2, found ndim=3.

Basically, I am really struggling with how to get this thing to work.

19. Mohammed June 29, 2018 at 2:57 pm #

Hello Jason,
I have a question. What if I want to map to a very long sequence. Say I want to map a sequence of lenth 100 to a sequence of lenth 3000. What the model would be like in your opinion? Could you please share your thoughts?

• Jason Brownlee June 29, 2018 at 3:28 pm #

Perhaps try encoder-decoder and LSTM autoencoder.

20. Rahim July 12, 2018 at 9:40 pm #

Hello Jason, thank you very much for this post!

I was wondering what is the reason of using one-hot encoding? Is it possible to avoid that?

21. Maryam August 13, 2018 at 2:47 am #

Hi Jason,
that was awesome like the other ones.
We have CuDNNGRU in keras but have u seen CuDNNRNN in keras?? As we have CuDNNRNN in tensorflow. I need CuDNNRNN in keras as it implements so faster than simplernn.

• Jason Brownlee August 13, 2018 at 6:20 am #

Sorry, I don’t have tutorials on CuDNNGRU.

22. Bayram August 28, 2018 at 3:55 pm #

Hi Jason,
>The network is fit for 50 epochs, new examples are generated each epoch and weight updates are performed after every 2 examples.

I can’t see 50 epochs, n_epoch = 100.
Also we fit each epoch, so don’t we update the weights for each example? How do we update after every 2 examples?

• Jason Brownlee August 29, 2018 at 8:06 am #

You’re right, I have updated the description of the model training. Thanks!

23. WALID S SABA December 2, 2018 at 8:05 am #

That’s why NNs are a big joke (and I am, yes, LOL)
Here’s the power of LOGIC…. here’s addition for you

DONE !!!

• Jason Brownlee December 3, 2018 at 6:36 am #

I don’t agree.

Also note that the tutorial is a demonstration of the capability of a learning algorithm, not a case study on the best way to solve the problem of adding numbers (perhaps I should have spelled that out more).

24. Muhammad zubair December 6, 2018 at 3:10 pm #

Hello sir How to use that model to gives the input to the user then check the user response
Example :
Model: 10 + 13
User: 26

• Jason Brownlee December 7, 2018 at 5:17 am #

This sounds like a general programming question, perhaps look-up the Python API for keyboard input?

25. Miguel August 30, 2019 at 6:31 am #

Great article! Helped me improve my understanding of LSTMs and helped me improve an LSTM I was working on.

I think there may be a small typo? Isn’t 12 elements/features/classes? 10 numbers plus 2 characters?

26. Tushar Seth September 14, 2019 at 5:25 am #

Hi Json
Thanks for this tutorial. But don’t you think that sum is linear , so it doesn’t even require any hidden layers?

What about squares of a number ? I tried your approach and tried to predict the squares of numbers, but the error was way to high to even consider. Can you please throw some light on this ?

• Jason Brownlee September 14, 2019 at 6:24 am #

Sum is linear, but this problem is symbolic/combinatorial.

Yes, I would love you to explore other operations! I would expect the technique to perform just as well.

Let me know how you go.

27. AG September 20, 2019 at 2:43 am #

Hi Jason,

In my model, there is 28 inputs and 2 outputs. All are in float value in 3 decimal. please guide how to create and train the LSTM model. The one_hot_encode is working for integer not for float. Please advise.

Thank you,
AG