How to Learn to Add Numbers with seq2seq Recurrent Neural Networks

Long Short-Term Memory (LSTM) networks are a type of Recurrent Neural Network (RNN) that are capable of learning the relationships between elements in an input sequence.

A good demonstration of LSTMs is to learn how to combine multiple terms together using a mathematical operation like a sum and outputting the result of the calculation.

A common mistake made by beginners is to simply learn the mapping function from input term to the output term. A good demonstration of LSTMs on such a problem involves learning the sequenced input of characters (“50+11”) and predicting the sequence output in characters (“61”). This hard problem can be learned with LSTMs using the sequence-to-sequence, or seq2seq (encoder-decoder), stacked LSTM configuration.

In this tutorial, you will discover how to address the problem of adding sequences of randomly generated integers using LSTMs.

After completing this tutorial, you will know:

  • How to learn the naive mapping function of input terms to output terms for addition.
  • How to frame the addition problem (and similar problems) and suitably encode inputs and outputs.
  • How to address the true sequence-prediction addition problem using the seq2seq paradigm.

Let’s get started.

How to Learn to Add Numbers with seq2seq Recurrent Neural Networks

How to Learn to Add Numbers with seq2seq Recurrent Neural Networks
Photo by Lima Pix, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

  1. Adding Numbers
  2. Addition as a Mapping Problem (the beginner’s mistake)
  3. Addition as a seq2seq Problem


This tutorial assumes a Python 2 or Python 3 development environment with SciPy, NumPy, Pandas installed.

The tutorial also assumes scikit-learn and Keras v2.0+ are installed with either the Theano or TensorFlow backend.

If you need help with your environment, see the post:

Adding Numbers

The task is that, given a sequence of randomly selected integers, to return the sum of those integers.

For example, given 10 + 5, the model should output 15.

The model is to be both trained and tested on randomly generated examples so that the general problem of adding numbers is learned, rather than memorization of specific cases.

Addition as a Mapping Problem
(the beginner’s mistake)

In this section, we will work through the problem and solve it using an LSTM and show how easy it is to make the beginner’s mistake and not harness the power of recurrent neural networks.

Data Generation

Let’s start off by defining a function to generate sequences of random integers and their sum as training and test data.

We can use the randint() function to generate random integers between a min and max value, such as between 1 and 100. We can then sum the sequence. This process can be repeated for a fixed number of times to create pairs of input sequences of numbers and matching output summed values.

For example, this snippet will create 100 examples of adding 2 numbers between 1 and 100:

Running the example will print each input-output pair.

Once we have the patterns, we can convert the lists to NumPy Arrays and rescale the values. We must rescale the values to fit within the bounds of the activation used by the LSTM.

For example:

Putting this all together, we can define the function random_sum_pairs() that takes a specified number of examples, a number of integers in each sequence, and the largest integer to generate and return X, y pairs of data for modeling.

We may want to invert the rescaling of numbers later. This will be useful to compare predictions to expected values and get an idea of an error score in the same units as the original data.

The invert() function below inverts the normalization of predicted and expected values passed in.

Configure LSTM

We can now define an LSTM to model this problem.

It’s a relatively simple problem, so the model does not need to be very large. The input layer will expect 1 input feature and 2 time steps (in the case of adding two numbers).

Two hidden LSTM layers are defined, the first with 6 units and the second with 2 units, followed by a fully connected output layer that returns a single sum value.

The efficient ADAM optimization algorithm is used to fit the model along with the mean squared error loss function given the real valued output of the network.

The network is fit for 50 epochs, new examples are generated each epoch and weight updates are performed after every 2 examples.

LSTM Evaluation

We evaluate the network on 100 new patterns.

These are generated and a sum value is predicted for each. Both the actual and predicted sum values are rescaled to the original range and a Root Mean Squared Error (RMSE) score is calculated that has the same scale as the original values. Finally, some 20 examples of expected and predicted values are listed as examples.

Finally, 20 examples of expected and predicted values are listed as examples.

Complete Example

We can tie this all together. The complete code example is listed below.

Running the example prints some loss information each epoch and finishes by printing the RMSE for the run and some example outputs.

The results are not perfect, but many examples are predicted correctly.

Your specific outputs may differ given the stochastic nature of neural networks.

The Beginner’s Mistake

All done, right?


The problem we have solved had multiple inputs but was technically not a sequence prediction problem.

In fact, you can just as easily solve it using a multilayer Perceptron (MLP). For example:

Running the example solves the problem perfectly, and in fewer epochs.

The issue is that we encoded so much of the domain into the problem that it turned the problem from a sequence prediction problem into a function mapping problem.

That is, the order of the input no longer matters. We could shuffle it up any way we want and still learn the problem.

MLPs are designed to learn mapping functions and can easily nail the problem of learning how to add numbers.

On one hand, this is a better way to approach the specific problem of adding numbers because the model is simpler and the results are better. On the other, it is a terrible use of recurrent neural networks.

This is a beginner’s mistake I see replicated in many “introduction to LSTMs” around the web.

Addition as a Sequence Prediction Problem

There is another way to frame addition that makes it an unambiguous sequence prediction problem, and in turn makes it much harder to solve.

We can frame addition as an input and output string of characters and let the model figure out the meaning of the characters. The entire addition problem can be framed as a string of characters, such as “12+50” with the output “62”, or more specifically:

  • Input: [‘1’, ‘2’, ‘+’, ‘5’, ‘0’]
  • Output: [‘6’, ‘2’]

The model must learn not only the integer nature of the characters, but also the nature of the mathematical operation to perform.

Notice how sequence is now important, and that randomly shuffling the input will create a nonsense sequence that could not be related to the output sequence.

Also notice how the problem has transformed to have both an input and an output sequence. This is called a sequence-to-sequence prediction problem, or a seq2seq problem.

We can keep things simple with addition of two numbers, but we can see how this may be scaled to a variable number of terms and mathematical operations that could be given as input for the model to learn and generalize.

Note that this formation and the rest of this example is inspired by the addition seq2seq example in the Keras project, although I re-developed it from the ground up.

Data Generation

Data generation for the seq2seq definition of the problem is a lot more involved.

We will develop each piece as a standalone function so you can play with them and understand how they work. Hang in there.

The first step is to generate sequences of random integers and their sum, as before, but with no normalization. We can put this in a function named random_sum_pairs(), as follows.

Running just this function prints a single example of adding two random integers between 1 and 10.

The next step is to convert the integers to strings. The input string will have the format ’99+99′ and the output string will have the format ’99’.

Key to this function is the padding of numbers to ensure that each input and output sequence has the same number of characters. A padding character should be different from the data so the model can learn to ignore them. In this case, we use the space character for padding(‘ ‘) and pad the string on the left, keeping the information on the far right.

There are other ways to pad, such as padding each term individually. Try it and see if it results in better performance. Report your results in the comments below.

Padding requires we know how long the longest sequence may be. We can calculate this easily by taking the log10() of the largest integer we can generate and the ceiling of that number to get an idea of how many chars are needed for each number. We add 1 to the largest number to ensure we expect 3 chars instead of 2 chars for the case of a round largest number, like 200. We then need to add the right number of plus symbols.

A similar process is repeated on the output sequence, without the plus symbols of course.

The example below adds the to_string() function and demonstrates its usage with a single input/output pair.

Running this example first prints the integer sequence and the padded string representation of the same sequence.

Next, we need to encode each character in the string as an integer value. We have to work with numbers in neural networks after all, not characters.

Integer encoding transforms the problem into a classification problem, where the output sequence may be considered class outputs with 11 possible values each. This just so happens to be integers with some ordinal relationship (first 10 class values).

To perform this encoding, we must define the full alphabet of symbols that may appear in the string encoding, as follows:

Integer encoding then becomes a simple process of building a lookup table of character to integer offset and converting each char of each string, one by one.

The example below provides the integer_encode() function for integer encoding and demonstrates how to use it.

Running the example prints the integer encoded version of each string encoded pattern.

We can see that the space character (‘ ‘) was encoded with 11 and the three character (‘3’) was encoded as 3, and so on.

The next step is to binary encode the integer encoding sequences.

This involves converting each integer to a binary vector with the same length as the alphabet and marking the specific integer with a 1.

For example, a 0 integer represents the ‘0’ character and would be encoded as a binary vector with a 1 in the 0th position of an 11 element vector: [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The example below defines the one_hot_encode() function for binary encoding and demonstrates how to use it.

Running the example prints the binary encoded sequence for each integer encoding.

I’ve added some new lines to make the input and output binary encodings clearer.

You can see that a single sum pattern becomes a sequence of 5 binary encoded vectors, each with 11 elements. The output or sum becomes a sequence of 2 binary encoded vectors, again each with 11 elements.

We can tie all of these steps together into a function called generate_data(), listed below.

Finally, we need to invert the encoding to convert the output vectors back into numbers so we can compare expected output integers to predicted integers.

The invert() function below performs this operation. Key is first converting the binary encoding back into an integer using the argmax() function, then converting the integer back into a character using a reverse mapping of the integers to chars from the alphabet.

We now have everything we need to prepare data for this example.

Note, these functions were written for this post and I did not write any unit tests nor battle test them with all kinds of inputs. If you see or find an obvious bug, please let me know in the comments below.

Configure and Fit a seq2seq LSTM Model

We can now fit an LSTM model to this problem.

We can think of the model as being comprised of two key parts: the encoder and the decoder.

First, the input sequence is shown to the network one encoded character at a time. We need an encoding level to learn the relationship between the steps in the input sequence and develop an internal representation of these relationships.

The input to the network (for the two number case) is a series of 5 encoded characters (2 for each integer and one for the ‘+’) where each vector contains 11 features for the 11 possible characters that each item in the sequence may be.

The encoder will use a single LSTM hidden layer with 100 units.

The decoder must transform the learned internal representation of the input sequence into the correct output sequence. For this, we will use a hidden layer LSTM with 50 units, followed by an output layer.

The problem is defined as requiring two binary output vectors for the two output characters. We will use the same fully connected layer (Dense) to output each binary vector. To use the same layer twice, we will wrap it in a TimeDistributed() wrapper layer.

The output fully connected layer will use a softmax activation function to output values in the range [0,1].

There’s a problem though.

We must connect the encoder to the decoder and they do not fit.

That is, the encoder will produce a 2-dimensional matrix of 100 outputs for each input in the sequence of 5 vectors. The decoder is an LSTM layer that expects a 3D input of [samples, timesteps, features] in order to produce a decoded sequence of 1 sample with 2 timesteps each with 11 features.

If you try to force these pieces together, you get an error like:

Exactly as we would expect.

We can solve this using a RepeatVector layer. This layer simply repeats the provided 2D input n-times to create a 3D output.

The RepeatVector layer can be used like an adapter to fit the encoder and decoder parts of the network together. We can configure the RepeatVector to repeat the input 2 times. This creates a 3D output comprised of two copies of the sequence output from the encoder, that we can decode two times using the same fully connected layer for each of the two desired output vectors.

The problem is framed as a classification problem with 11 classes, therefore we can optimize the log loss (categorical_crossentropy) function and even track accuracy as well as loss on each training epoch.

Putting this together, we have:

Why Use a RepeatVector Layer?

Why not return the sequence output from the encoder as input for the decoder?

That is, one output for each LSTM at each input sequence time step rather than one output for each LSTM for the whole input sequence.

An output for each step of the input sequence gives the decoder access to the intermediate representation of the input sequence each step. This may or may not be useful. Providing the final LSTM output at the end of the input sequence may be more logical as it captures information about the entire input sequence, ready to map to or calculate an output.

Also, this leaves nothing in the network to specify the size of the decoder other than the input, giving one output value for each timestep of the input sequence (5 instead of 2).

You could reframe the output to be a sequence of 5 characters padded with whitespace. The network would be doing more work than is required and may lose some of the compression type capability provided by the encoder-decoder paradigm. Try it and see.

The issue titled “is the Sequence to Sequence learning right?” on the Keras GitHub project provides some good discussions of alternate representations you could play with.

Evaluate LSTM Model

As before, we can generate a new batch of examples and evaluate the algorithm after it has been fit.

We could calculate an RMSE score on the prediction, although I have left it out for simplicity here.

Complete Example

Putting it all together, the complete example is listed below.

Running the example nearly perfectly fits the problem. In fact, running for more epochs or increasing weight updates to every epoch (batch_size=1) will get you there, but will take 10 times longer to train.

We can see that the predicted outcome matches the expected outcome on the first 20 examples we look at.


This section lists some natural extensions to this tutorial that you may wish to explore.

  • Integer Encoding. Explore whether the problem can learn the problem better using an integer encoding alone. The ordinal relationship between most of the inputs may prove very useful.
  • Variable Numbers. Change the example to support a variable number of terms on each input sequence. This should be straightforward as long as you perform sufficient padding.
  • Variable Mathematical Operations. Change the example to vary the mathematical operation to allow the network to generalize even further.
  • Brackets. Allow the use of brackets along with other mathematical operations.

Did you try any of these extensions?
Share your findings in the comments; I’d love to see what you found.

Further Reading

This section lists some resources for further reading and other related examples you may find useful.


Code and Posts


In this tutorial, you discovered how to develop an LSTM network to learn how to add random integers together using the seq2seq stacked LSTM paradigm.

Specifically, you learned:

  • How to learn the naive mapping function of input terms to output terms for addition.
  • How to frame the addition problem (and similar problems) and suitably encode inputs and outputs.
  • How to address the true sequence-prediction addition problem using the seq2seq paradigm.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

28 Responses to How to Learn to Add Numbers with seq2seq Recurrent Neural Networks

  1. xi May 20, 2017 at 6:22 pm #

    LOVE U

  2. Chauhan Hardik May 20, 2017 at 9:56 pm #

    How can we write code for decoder network whose input is encoder’s memory plus previous time step’s output ?

    • Jason Brownlee May 21, 2017 at 5:59 am #

      Good question, I have not done this myself yet. It may require a careful network design.

  3. Giuseppe Bonaccorso May 21, 2017 at 1:05 am #

    very interesting article. I’ve written an extension for coping with more complex expressions (it’s available on this GIST:

    Unfortunately, there are still many errors, but it’s probably due to the size of the training dataset (which doesn’t contain all possible examples). That’s probably the hardest part of Seq2Seq, I mean creating a model which can also learn semantics, so that can be easily trained with fewer examples and with always better performances.

    I’m continuing my experiments!

  4. max ales May 21, 2017 at 3:27 am #

    I’m sorry but i get this error. What’s wrong?

    Using Theano backend.
    Traceback (most recent call last):
    File “”, line 110, in
    model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))
    File “/usr/local/lib/python2.7/dist-packages/keras/”, line 430, in add
    File “/usr/local/lib/python2.7/dist-packages/keras/layers/”, line 257, in __call__
    return super(Recurrent, self).__call__(inputs, **kwargs)
    File “/usr/local/lib/python2.7/dist-packages/keras/engine/”, line 578, in __call__
    output =, **kwargs)
    File “/usr/local/lib/python2.7/dist-packages/keras/layers/”, line 295, in call
    preprocessed_input = self.preprocess_input(inputs, training=None)
    File “/usr/local/lib/python2.7/dist-packages/keras/layers/”, line 1028, in preprocess_input
    timesteps, training=training)
    File “/usr/local/lib/python2.7/dist-packages/keras/layers/”, line 58, in _time_distributed_dense
    x = K.reshape(x, (-1, timesteps, output_dim))
    File “/usr/local/lib/python2.7/dist-packages/keras/backend/”, line 739, in reshape
    y = T.reshape(x, shape)
    File “/usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano/tensor/”, line 4910, in reshape
    rval = op(x, newshape)
    File “/usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano/gof/”, line 615, in __call__
    node = self.make_node(*inputs, **kwargs)
    File “/usr/local/lib/python2.7/dist-packages/Theano-0.9.0-py2.7.egg/theano/tensor/”, line 4748, in make_node
    raise TypeError(“Shape must be integers”, shp, shp.dtype)
    TypeError: (‘Shape must be integers’, TensorConstant{[ -1. 5. 100.]}, ‘float64’)

    • Jason Brownlee May 21, 2017 at 6:02 am #

      Ensure you have copied the example exactly without any extra white space.

      Also ensure you are using Keras 2.0 or higher.

      • giorgio borghi May 21, 2017 at 9:27 pm #

        thanx for your help, but
        i wrote carefully

        line model.add(LSTM(100, input_shape=(n_in_seq_length, n_chars)))

        raise TypeError(“Shape must be integers”, shp, shp.dtype)
        TypeError: (‘Shape must be integers’, TensorConstant{[ -1. 5. 100.]}, ‘float64’)

        • Jason Brownlee May 22, 2017 at 7:53 am #

          I have not seen this error before, sorry.

          Do any other examples work on your system?

          • Giorgio borghi May 23, 2017 at 9:57 pm #

            Yes . Keras is 2.0.4 and your post is too important and awesome ! Great job !

          • RealUser404 June 25, 2017 at 9:22 am #

            I got almost the same error when just copy/pasting the code :
            TypeError: Value passed to parameter ‘shape’ has DataType float32 not in list of allowed values: int32, int64

          • Jason Brownlee June 26, 2017 at 6:03 am #

            Sorry, I have not seen this error. Ensure you have the latest version of all the libraries installed.

            See if you can pull the example apart and narrow down the line the fails. It looks like numpy and the .shape property. Perhaps your numpy is not up to date. shape always returns an int. No idea what this error could be,.

          • RealUser404 June 26, 2017 at 9:54 am #

            Well it was due to the n_out_seq_length = ceil(log10(n_numbers * (largest+1))) lines.

            I do not know why it does that, as ceil is supposed to return an integer.
            If I change manually the value to n_out_seq_length = 2, then it works correctly.

          • Jason Brownlee June 27, 2017 at 8:20 am #


  5. Birkey May 21, 2017 at 12:20 pm #

    Nice work Jason!
    For seq2seq problem, RNN is the choice by default. But it seems quite subtle to choose between MLP vs RNN for seq2vec problem. In the case discussed, if we “encode too much domain knowledge into the problem” we turned it into a function mapping problem, here by “domain knowledge ” we means “split addition equation into added and addon” . But this kind of “Data preprocessing” is not uncommon for input sequence.

    Q: do you have any principles or guidelines for model choosing for seq2vec problem?

    BTW: I have read the keras issue you recommend. Maybe there’s answer

    • Jason Brownlee May 22, 2017 at 7:52 am #

      Great question.

      It really comes down to what performs best on your problem. You may want a seq2seq formulation because of the elegance, but a mapping-based solution (seq2vec) may just perform better.

      In practice, I would try a suite of methods and double down on what works best.

  6. Kunpeng Zhang May 30, 2017 at 11:47 pm #

    Is seq2seq can be used for time series prediction? Given you delivered lots of wonderful posts over time series, what if translate traditional time series prediction to a seq2seq formulation?
    Could you give me some advice?

    • Jason Brownlee June 2, 2017 at 12:37 pm #

      Yes, if you have a different number of output time steps for each one input time step.

      • Kunpeng Zhang June 6, 2017 at 1:24 am #

        Thank you for your response.

  7. Mehrdad May 31, 2017 at 3:00 am #

    Hi Jason. Thanks for your great article.
    Something is not clear for me. LSTMs are capable of learning series (sequences) with different length. Why we should use padding to have sequences with same length?

    • Jason Brownlee June 2, 2017 at 12:39 pm #

      Here, the difference is between the lengths of input and output sequences, not differences in the lengths of input sequences themselves.

      • mehrdadscomputer June 5, 2017 at 10:07 pm #

        Thanks Jason.

        I did some searches and it seems that after padding, it’s essential to use masking or initializing sample_weight to say model to ignore white spaces. It doesn’t learn to automatically ignore padded characters (I had a wrong assumption about it)

        I did masking for this problem and the result is as follow:

        The result without masking for 10 epochs is: loss: 0.6787 – acc: 0.8185
        The result with masking for 10 epochs is: loss: 0.6202 – acc: 0.8705

        As you see, there is too much improvement. But may please say how we can use apply sample_weight to this problem?

        The improvement achieved by changing this part of code:

        model = Sequential()
        model.add(Masking(mask_value = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1], input_shape=(n_in_seq_length, n_chars)))

        Note: remember to add:
        from keras.layers import Masking

        • Jason Brownlee June 6, 2017 at 9:35 am #

          Great work! Thanks for reporting your results.

          I have a post on masking scheduled that should provide further help.

          What do you mean sample weight? Weighting input samples? I guess neural nets do this already. You could also resample your training data to affect the sampling distribution of different classes.

          • mehrdad12 June 8, 2017 at 6:47 pm #

            Thanks Jason.
            I run my model many times and it seems that the improvement is because of randomness and not masking, unfortunately.

            I am really excited about your new post on masking.

            Yes, I mean weighting input samples.

            I asked about this in Keras’ slack channel and some guys told me that neural network is not capable of learning to ignore padded part of time series and I need to do it by weighting input samples

          • Jason Brownlee June 9, 2017 at 6:22 am #

            Perhaps you could ask what this person meant.

            Off the cuff, I cannot see how weighting inputs would help with masking.

  8. RealUser404 June 25, 2017 at 10:17 am #

    This was a great tutorial, thanks a lot!

    I have a question concerning the size of the LSTM layers. You chose to use a LSTM layer of 100 neurons for the encoder, and 50 for the decoder. How did you choose those values? Are these somehow related to the size of the input and size of the output? Does that mean if I use an input twice as big (numbers of 4 digits instead of 2) I should be doubling the number of neurons too?

    Thank you in advance!

    • Jason Brownlee June 26, 2017 at 6:04 am #

      No, just a little trial and error.

      Experiment and see how different values affect model skill.

Leave a Reply