Last Updated on August 7, 2022

A powerful and popular recurrent neural network is the long short-term model network or LSTM.

It is widely used because the architecture overcomes the vanishing and exposing gradient problem that plagues all recurrent neural networks, allowing very large and very deep networks to be created.

Like other recurrent neural networks, LSTM networks maintain state, and the specifics of how this is implemented in the Keras framework can be confusing.

In this post, you will discover exactly how state is maintained in LSTM networks by the Keras deep learning library.

After reading this post, you will know:

- How to develop a naive LSTM network for a sequence prediction problem
- How to carefully manage state through batches and features with an LSTM network
- How to manually manage state in an LSTM network for stateful prediction

**Kick-start your project** with my new book Deep Learning With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Jul/2016**: First published**Update Mar/2017**: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0**Update Aug/2018**: Updated examples for Python 3, updated stateful example to get 100% accuracy**Update Mar/2019**: Fixed typo in the stateful example**Update Jul/2022**: Updated for TensorFlow 2.x API

## Problem Description: Learn the Alphabet

In this tutorial, you will develop and contrast a number of different LSTM recurrent neural network models.

The context of these comparisons will be a simple sequence prediction problem of learning the alphabet. That is, given a letter of the alphabet, it will predict the next letter of the alphabet.

This is a simple sequence prediction problem that, once understood, can be generalized to other sequence prediction problems like time series prediction and sequence classification.

Let’s prepare the problem with some Python code you can reuse from example to example.

First, let’s import all of the classes and functions you will use in this tutorial.

1 2 3 4 5 6 |
import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical |

Next, you can seed the random number generator to ensure that the results are the same each time the code is executed.

1 2 |
# fix random seed for reproducibility tf.random.set_seed(7) |

You can now define your dataset, the alphabet. You define the alphabet in uppercase characters for readability.

Neural networks model numbers, so you need to map the letters of the alphabet to integer values. You can do this easily by creating a dictionary (map) of the letter index to the character. You can also create a reverse lookup for converting predictions back into characters to be used later.

1 2 3 4 5 |
# define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) |

Now, You need to create your input and output pairs on which to train your neural network. You can do this by defining an input sequence length, then reading sequences from the input alphabet sequence.

For example, use an input length of 1. Starting at the beginning of the raw input data, you can read off the first letter “A” and the next letter as the prediction “B.” You move along one character and repeat until You reach a prediction of “Z.”

1 2 3 4 5 6 7 8 9 10 |
# prepare the dataset of input to output pairs encoded as integers seq_length = 1 dataX = [] dataY = [] for i in range(0, len(alphabet) - seq_length, 1): seq_in = alphabet[i:i + seq_length] seq_out = alphabet[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) print(seq_in, '->', seq_out) |

Also, print out the input pairs for sanity checking.

Running the code to this point will produce the following output, summarizing input sequences of length 1 and a single output character.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
A -> B B -> C C -> D D -> E E -> F F -> G G -> H H -> I I -> J J -> K K -> L L -> M M -> N N -> O O -> P P -> Q Q -> R R -> S S -> T T -> U U -> V V -> W W -> X X -> Y Y -> Z |

You need to reshape the NumPy array into a format expected by the LSTM networks, specifically [*samples, time steps, features*].

1 2 |
# reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), seq_length, 1)) |

Once reshaped, you can then normalize the input integers to the range 0-to-1, the range of the sigmoid activation functions used by the LSTM network.

1 2 |
# normalize X = X / float(len(alphabet)) |

Finally, you can think of this problem as a sequence classification task, where each of the 26 letters represents a different class. As such, you can convert the output (y) to a one-hot encoding using the Keras built-in function **to_categorical()**.

1 2 |
# one hot encode the output variable y = to_categorical(dataY) |

You are now ready to fit different LSTM models.

### Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

## Naive LSTM for Learning One-Char to One-Char Mapping

Let’s start by designing a simple LSTM to learn how to predict the next character in the alphabet, given the context of just one character.

You will frame the problem as a random collection of one-letter input to one-letter output pairs. As you will see, this is a problematic framing of the problem for the LSTM to learn.

Let’s define an LSTM network with 32 units and an output layer with a softmax activation function for making predictions. Because this is a multi-class classification problem, you can use the log loss function (called “**categorical_crossentropy**” in Keras) and optimize the network using the ADAM optimization function.

The model is fit over 500 epochs with a batch size of 1.

1 2 3 4 5 6 |
# create and fit the model model = Sequential() model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, y, epochs=500, batch_size=1, verbose=2) |

After you fit the model, you can evaluate and summarize the performance of the entire training dataset.

1 2 3 |
# summarize performance of the model scores = model.evaluate(X, y, verbose=0) print("Model Accuracy: %.2f%%" % (scores[1]*100)) |

You can then re-run the training data through the network and generate predictions, converting both the input and output pairs back into their original character format to get a visual idea of how well the network learned the problem.

1 2 3 4 5 6 7 8 9 |
# demonstrate some model predictions for pattern in dataX: x = np.reshape(pattern, (1, len(pattern), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) |

The entire code listing is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# Naive LSTM to learn one-char to one-char mapping import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical # fix random seed for reproducibility tf.random.set_seed(7) # define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) # prepare the dataset of input to output pairs encoded as integers seq_length = 1 dataX = [] dataY = [] for i in range(0, len(alphabet) - seq_length, 1): seq_in = alphabet[i:i + seq_length] seq_out = alphabet[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) print(seq_in, '->', seq_out) # reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), seq_length, 1)) # normalize X = X / float(len(alphabet)) # one hot encode the output variable y = to_categorical(dataY) # create and fit the model model = Sequential() model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, y, epochs=500, batch_size=1, verbose=2) # summarize performance of the model scores = model.evaluate(X, y, verbose=0) print("Model Accuracy: %.2f%%" % (scores[1]*100)) # demonstrate some model predictions for pattern in dataX: x = np.reshape(pattern, (1, len(pattern), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example produces the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
Model Accuracy: 84.00% ['A'] -> B ['B'] -> C ['C'] -> D ['D'] -> E ['E'] -> F ['F'] -> G ['G'] -> H ['H'] -> I ['I'] -> J ['J'] -> K ['K'] -> L ['L'] -> M ['M'] -> N ['N'] -> O ['O'] -> P ['P'] -> Q ['Q'] -> R ['R'] -> S ['S'] -> T ['T'] -> U ['U'] -> W ['V'] -> Y ['W'] -> Z ['X'] -> Z ['Y'] -> Z |

You can see that this problem is indeed difficult for the network to learn.

The reason is that the poor LSTM units do not have any context to work with. Each input-output pattern is shown to the network in a random order, and the state of the network is reset after each pattern (each batch where each batch contains one pattern).

This is an abuse of the LSTM network architecture, treating it like a standard multilayer perceptron.

Next, let’s try a different framing of the problem to provide more sequence to the network from which to learn.

## Naive LSTM for a Three-Char Feature Window to One-Char Mapping

A popular approach to adding more context to data for multilayer perceptrons is to use the window method.

This is where previous steps in the sequence are provided as additional input features to the network. You can try the same trick to provide more context to the LSTM network.

Here, you will increase the sequence length from 1 to 3, for example:

1 2 |
# prepare the dataset of input to output pairs encoded as integers seq_length = 3 |

This creates training patterns like this:

1 2 3 |
ABC -> D BCD -> E CDE -> F |

Each element in the sequence is then provided as a new input feature to the network. This requires a modification of how the input sequences are reshaped in the data preparation step:

1 2 |
# reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), 1, seq_length)) |

It also requires modifying how the sample patterns are reshaped when demonstrating predictions from the model.

1 |
x = np.reshape(pattern, (1, 1, len(pattern))) |

The entire code listing is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# Naive LSTM to learn three-char window to one-char mapping import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical # fix random seed for reproducibility tf.random.set_seed(7) # define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) # prepare the dataset of input to output pairs encoded as integers seq_length = 3 dataX = [] dataY = [] for i in range(0, len(alphabet) - seq_length, 1): seq_in = alphabet[i:i + seq_length] seq_out = alphabet[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) print(seq_in, '->', seq_out) # reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), 1, seq_length)) # normalize X = X / float(len(alphabet)) # one hot encode the output variable y = to_categorical(dataY) # create and fit the model model = Sequential() model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, y, epochs=500, batch_size=1, verbose=2) # summarize performance of the model scores = model.evaluate(X, y, verbose=0) print("Model Accuracy: %.2f%%" % (scores[1]*100)) # demonstrate some model predictions for pattern in dataX: x = np.reshape(pattern, (1, 1, len(pattern))) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Model Accuracy: 86.96% ['A', 'B', 'C'] -> D ['B', 'C', 'D'] -> E ['C', 'D', 'E'] -> F ['D', 'E', 'F'] -> G ['E', 'F', 'G'] -> H ['F', 'G', 'H'] -> I ['G', 'H', 'I'] -> J ['H', 'I', 'J'] -> K ['I', 'J', 'K'] -> L ['J', 'K', 'L'] -> M ['K', 'L', 'M'] -> N ['L', 'M', 'N'] -> O ['M', 'N', 'O'] -> P ['N', 'O', 'P'] -> Q ['O', 'P', 'Q'] -> R ['P', 'Q', 'R'] -> S ['Q', 'R', 'S'] -> T ['R', 'S', 'T'] -> U ['S', 'T', 'U'] -> V ['T', 'U', 'V'] -> Y ['U', 'V', 'W'] -> Z ['V', 'W', 'X'] -> Z ['W', 'X', 'Y'] -> Z |

You can see a slight lift in performance that may or may not be real. This is a simple problem that you were still not able to learn with LSTMs even with the window method.

Again, this is a misuse of the LSTM network by a poor framing of the problem. Indeed, the sequences of letters are time steps of one feature rather than one time step of separate features. You have given more context to the network but not more sequence as expected.

In the next section, you will give more context to the network in the form of time steps.

## Naive LSTM for a Three-Char Time Step Window to One-Char Mapping

In Keras, the intended use of LSTMs is to provide context in the form of time steps, rather than windowed features like with other network types.

You can take your first example and simply change the sequence length from 1 to 3.

1 |
seq_length = 3 |

Again, this creates input-output pairs that look like this:

1 2 3 4 |
ABC -> D BCD -> E CDE -> F DEF -> G |

The difference is that the reshaping of the input data takes the sequence as a time step sequence of one feature rather than a single time step of multiple features.

1 2 |
# reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), seq_length, 1)) |

This is the correct intended use of providing sequence context to your LSTM in Keras. The full code example is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 |
# Naive LSTM to learn three-char time steps to one-char mapping import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical # fix random seed for reproducibility tf.random.set_seed(7) # define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) # prepare the dataset of input to output pairs encoded as integers seq_length = 3 dataX = [] dataY = [] for i in range(0, len(alphabet) - seq_length, 1): seq_in = alphabet[i:i + seq_length] seq_out = alphabet[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) print(seq_in, '->', seq_out) # reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), seq_length, 1)) # normalize X = X / float(len(alphabet)) # one hot encode the output variable y = to_categorical(dataY) # create and fit the model model = Sequential() model.add(LSTM(32, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, y, epochs=500, batch_size=1, verbose=2) # summarize performance of the model scores = model.evaluate(X, y, verbose=0) print("Model Accuracy: %.2f%%" % (scores[1]*100)) # demonstrate some model predictions for pattern in dataX: x = np.reshape(pattern, (1, len(pattern), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this example provides the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Model Accuracy: 100.00% ['A', 'B', 'C'] -> D ['B', 'C', 'D'] -> E ['C', 'D', 'E'] -> F ['D', 'E', 'F'] -> G ['E', 'F', 'G'] -> H ['F', 'G', 'H'] -> I ['G', 'H', 'I'] -> J ['H', 'I', 'J'] -> K ['I', 'J', 'K'] -> L ['J', 'K', 'L'] -> M ['K', 'L', 'M'] -> N ['L', 'M', 'N'] -> O ['M', 'N', 'O'] -> P ['N', 'O', 'P'] -> Q ['O', 'P', 'Q'] -> R ['P', 'Q', 'R'] -> S ['Q', 'R', 'S'] -> T ['R', 'S', 'T'] -> U ['S', 'T', 'U'] -> V ['T', 'U', 'V'] -> W ['U', 'V', 'W'] -> X ['V', 'W', 'X'] -> Y ['W', 'X', 'Y'] -> Z |

You can see that the model learns the problem perfectly as evidenced by the model evaluation and the example predictions.

But it has learned a simpler problem. Specifically, it has learned to predict the next letter from a sequence of three letters in the alphabet. It can be shown any random sequence of three letters from the alphabet and predict the next letter.

It can not actually enumerate the alphabet. It’s possible a larger enough multilayer perception network might be able to learn the same mapping using the window method.

The LSTM networks are stateful. They should be able to learn the whole alphabet sequence, but by default, the Keras implementation resets the network state after each training batch.

## LSTM State within a Batch

The Keras implementation of LSTMs resets the state of the network after each batch.

This suggests that if you had a batch size large enough to hold all input patterns and if all the input patterns were ordered sequentially, the LSTM could use the context of the sequence within the batch to better learn the sequence.

You can demonstrate this easily by modifying the first example for learning a one-to-one mapping and increasing the batch size from 1 to the size of the training dataset.

Additionally, Keras shuffles the training dataset before each training epoch. To ensure the training data patterns remain sequential, you can disable this shuffling.

1 |
model.fit(X, y, epochs=5000, batch_size=len(dataX), verbose=2, shuffle=False) |

The network will learn the mapping of characters using the within-batch sequence, but this context will not be available to the network when making predictions. You can evaluate both the ability of the network to make predictions randomly and in sequence.

The full code example is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
# Naive LSTM to learn one-char to one-char mapping with all data in each batch import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical from tensorflow.keras.preprocessing.sequence import pad_sequences # fix random seed for reproducibility tf.random.set_seed(7) # define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) # prepare the dataset of input to output pairs encoded as integers seq_length = 1 dataX = [] dataY = [] for i in range(0, len(alphabet) - seq_length, 1): seq_in = alphabet[i:i + seq_length] seq_out = alphabet[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) print(seq_in, '->', seq_out) # convert list of lists to array and pad sequences if needed X = pad_sequences(dataX, maxlen=seq_length, dtype='float32') # reshape X to be [samples, time steps, features] X = np.reshape(dataX, (X.shape[0], seq_length, 1)) # normalize X = X / float(len(alphabet)) # one hot encode the output variable y = to_categorical(dataY) # create and fit the model model = Sequential() model.add(LSTM(16, input_shape=(X.shape[1], X.shape[2]))) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, y, epochs=5000, batch_size=len(dataX), verbose=2, shuffle=False) # summarize performance of the model scores = model.evaluate(X, y, verbose=0) print("Model Accuracy: %.2f%%" % (scores[1]*100)) # demonstrate some model predictions for pattern in dataX: x = np.reshape(pattern, (1, len(pattern), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) # demonstrate predicting random patterns print("Test a Random Pattern:") for i in range(0,20): pattern_index = np.random.randint(len(dataX)) pattern = dataX[pattern_index] x = np.reshape(pattern, (1, len(pattern), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example provides the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
Model Accuracy: 100.00% ['A'] -> B ['B'] -> C ['C'] -> D ['D'] -> E ['E'] -> F ['F'] -> G ['G'] -> H ['H'] -> I ['I'] -> J ['J'] -> K ['K'] -> L ['L'] -> M ['M'] -> N ['N'] -> O ['O'] -> P ['P'] -> Q ['Q'] -> R ['R'] -> S ['S'] -> T ['T'] -> U ['U'] -> V ['V'] -> W ['W'] -> X ['X'] -> Y ['Y'] -> Z Test a Random Pattern: ['T'] -> U ['V'] -> W ['M'] -> N ['Q'] -> R ['D'] -> E ['V'] -> W ['T'] -> U ['U'] -> V ['J'] -> K ['F'] -> G ['N'] -> O ['B'] -> C ['M'] -> N ['F'] -> G ['F'] -> G ['P'] -> Q ['A'] -> B ['K'] -> L ['W'] -> X ['E'] -> F |

As expected, the network is able to use the within-sequence context to learn the alphabet, achieving 100% accuracy on the training data.

Importantly, the network can make accurate predictions for the next letter in the alphabet for randomly selected characters. Very impressive.

## Stateful LSTM for a One-Char to One-Char Mapping

You have seen that you can break up the raw data into fixed-size sequences and that this representation can be learned by the LSTM but only to learn random mappings of 3 characters to 1 character.

You have also seen that you can pervert the batch size to offer more sequence to the network, but only during training.

Ideally, you want to expose the network to the entire sequence and let it learn the inter-dependencies rather than you defining those dependencies explicitly in the framing of the problem.

You can do this in Keras by making the LSTM layers stateful and manually resetting the state of the network at the end of the epoch, which is also the end of the training sequence.

This is truly how the LSTM networks are intended to be used.

You first need to define your LSTM layer as stateful. In so doing, you must explicitly specify the batch size as a dimension on the input shape. This also means that when you evaluate the network or make predictions, you must also specify and adhere to this same batch size. This is not a problem now as you are using a batch size of 1. This could introduce difficulties when making predictions when the batch size is not one, as predictions will need to be made in the batch and the sequence.

1 2 |
batch_size = 1 model.add(LSTM(50, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True)) |

An important difference in training the stateful LSTM is that you manually train it one epoch at a time and reset the state after each epoch. You can do this in a for loop. Again, do not shuffle the input, preserving the sequence in which the input training data was created.

1 2 3 |
for i in range(300): model.fit(X, y, epochs=1, batch_size=batch_size, verbose=2, shuffle=False) model.reset_states() |

As mentioned, you specify the batch size when evaluating the performance of the network on the entire training dataset.

1 2 3 4 |
# summarize performance of the model scores = model.evaluate(X, y, batch_size=batch_size, verbose=0) model.reset_states() print("Model Accuracy: %.2f%%" % (scores[1]*100)) |

Finally, you can demonstrate that the network has indeed learned the entire alphabet. You can seed it with the first letter “A,” request a prediction, feed the prediction back in as an input, and repeat the process all the way to “Z.”

1 2 3 4 5 6 7 8 9 10 |
# demonstrate some model predictions seed = [char_to_int[alphabet[0]]] for i in range(0, len(alphabet)-1): x = np.reshape(seed, (1, len(seed), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) print(int_to_char[seed[0]], "->", int_to_char[index]) seed = [index] model.reset_states() |

You can also see if the network can make predictions starting from an arbitrary letter.

1 2 3 4 5 6 7 8 9 10 11 12 |
# demonstrate a random starting point letter = "K" seed = [char_to_int[letter]] print("New start: ", letter) for i in range(0, 5): x = np.reshape(seed, (1, len(seed), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) print(int_to_char[seed[0]], "->", int_to_char[index]) seed = [index] model.reset_states() |

The entire code listing is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 |
# Stateful LSTM to learn one-char to one-char mapping import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical # fix random seed for reproducibility tf.random.set_seed(7) # define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) # prepare the dataset of input to output pairs encoded as integers seq_length = 1 dataX = [] dataY = [] for i in range(0, len(alphabet) - seq_length, 1): seq_in = alphabet[i:i + seq_length] seq_out = alphabet[i + seq_length] dataX.append([char_to_int[char] for char in seq_in]) dataY.append(char_to_int[seq_out]) print(seq_in, '->', seq_out) # reshape X to be [samples, time steps, features] X = np.reshape(dataX, (len(dataX), seq_length, 1)) # normalize X = X / float(len(alphabet)) # one hot encode the output variable y = to_categorical(dataY) # create and fit the model batch_size = 1 model = Sequential() model.add(LSTM(50, batch_input_shape=(batch_size, X.shape[1], X.shape[2]), stateful=True)) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) for i in range(300): model.fit(X, y, epochs=1, batch_size=batch_size, verbose=2, shuffle=False) model.reset_states() # summarize performance of the model scores = model.evaluate(X, y, batch_size=batch_size, verbose=0) model.reset_states() print("Model Accuracy: %.2f%%" % (scores[1]*100)) # demonstrate some model predictions seed = [char_to_int[alphabet[0]]] for i in range(0, len(alphabet)-1): x = np.reshape(seed, (1, len(seed), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) print(int_to_char[seed[0]], "->", int_to_char[index]) seed = [index] model.reset_states() # demonstrate a random starting point letter = "K" seed = [char_to_int[letter]] print("New start: ", letter) for i in range(0, 5): x = np.reshape(seed, (1, len(seed), 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) print(int_to_char[seed[0]], "->", int_to_char[index]) seed = [index] model.reset_states() |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example provides the following output.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
Model Accuracy: 100.00% A -> B B -> C C -> D D -> E E -> F F -> G G -> H H -> I I -> J J -> K K -> L L -> M M -> N N -> O O -> P P -> Q Q -> R R -> S S -> T T -> U U -> V V -> W W -> X X -> Y Y -> Z New start: K K -> B B -> C C -> D D -> E E -> F |

You can see that the network has memorized the entire alphabet perfectly. It used the context of the samples themselves and learned whatever dependency it needed to predict the next character in the sequence.

You can also see that if you seed the network with the first letter, it can correctly rattle off the rest of the alphabet.

You can also see that it has only learned the full alphabet sequence and from a cold start. When asked to predict the next letter from “K,” it predicts “B” and falls back into regurgitating the entire alphabet.

To truly predict “K,” the state of the network would need to be warmed up and iteratively fed the letters from “A” to “J.” This reveals that you could achieve the same effect with a “stateless” LSTM by preparing training data like this:

1 2 3 4 |
---a -> b --ab -> c -abc -> d abcd -> e |

Here, the input sequence is fixed at 25 (a-to-y to predict z), and patterns are prefixed with zero padding.

Finally, this raises the question of training an LSTM network using variable length input sequences to predict the next character.

## LSTM with Variable-Length Input to One-Char Output

In the previous section, you discovered that the Keras “stateful” LSTM was really only a shortcut to replaying the first n-sequences but didn’t really help us learn a generic model of the alphabet.

In this section, you will explore a variation of the “stateless” LSTM that learns random subsequences of the alphabet and an effort to build a model that can be given arbitrary letters or subsequences of letters and predict the next letter in the alphabet.

First, you are changing the framing of the problem. To simplify, you will define a maximum input sequence length and set it to a small value like 5 to speed up training. This defines the maximum length of subsequences of the alphabet that will be drawn for training. In extensions, this could just be set to the full alphabet (26) or longer if you allow looping back to the start of the sequence.

You also need to define the number of random sequences to create—in this case, 1000. This, too, could be more or less. It’s likely fewer patterns are actually required.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# prepare the dataset of input to output pairs encoded as integers num_inputs = 1000 max_len = 5 dataX = [] dataY = [] for i in range(num_inputs): start = np.random.randint(len(alphabet)-2) end = np.random.randint(start, min(start+max_len,len(alphabet)-1)) sequence_in = alphabet[start:end+1] sequence_out = alphabet[end + 1] dataX.append([char_to_int[char] for char in sequence_in]) dataY.append(char_to_int[sequence_out]) print(sequence_in, '->', sequence_out) |

Running this code in the broader context will create input patterns that look like the following:

1 2 3 4 5 6 7 8 9 |
PQRST -> U W -> X O -> P OPQ -> R IJKLM -> N QRSTU -> V ABCD -> E X -> Y GHIJ -> K |

The input sequences vary in length between 1 and **max_len** and therefore require zero padding. Here, use left-hand-side (prefix) padding with the Keras built-in **pad_sequences()** function.

1 |
X = pad_sequences(dataX, maxlen=max_len, dtype='float32') |

The trained model is evaluated on randomly selected input patterns. This could just as easily be new randomly generated sequences of characters. This could also be a linear sequence seeded with “A” with outputs fed back in as single character inputs.

The full code listing is provided below for completeness.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
# LSTM with Variable Length Input Sequences to One Character Output import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense from tensorflow.keras.layers import LSTM from tensorflow.keras.utils import to_categorical from tensorflow.keras.preprocessing.sequence import pad_sequences # fix random seed for reproducibility np.random.seed(7) tf.random.set_seed(7) # define the raw dataset alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" # create mapping of characters to integers (0-25) and the reverse char_to_int = dict((c, i) for i, c in enumerate(alphabet)) int_to_char = dict((i, c) for i, c in enumerate(alphabet)) # prepare the dataset of input to output pairs encoded as integers num_inputs = 1000 max_len = 5 dataX = [] dataY = [] for i in range(num_inputs): start = np.random.randint(len(alphabet)-2) end = np.random.randint(start, min(start+max_len,len(alphabet)-1)) sequence_in = alphabet[start:end+1] sequence_out = alphabet[end + 1] dataX.append([char_to_int[char] for char in sequence_in]) dataY.append(char_to_int[sequence_out]) print(sequence_in, '->', sequence_out) # convert list of lists to array and pad sequences if needed X = pad_sequences(dataX, maxlen=max_len, dtype='float32') # reshape X to be [samples, time steps, features] X = np.reshape(X, (X.shape[0], max_len, 1)) # normalize X = X / float(len(alphabet)) # one hot encode the output variable y = to_categorical(dataY) # create and fit the model batch_size = 1 model = Sequential() model.add(LSTM(32, input_shape=(X.shape[1], 1))) model.add(Dense(y.shape[1], activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.fit(X, y, epochs=500, batch_size=batch_size, verbose=2) # summarize performance of the model scores = model.evaluate(X, y, verbose=0) print("Model Accuracy: %.2f%%" % (scores[1]*100)) # demonstrate some model predictions for i in range(20): pattern_index = np.random.randint(len(dataX)) pattern = dataX[pattern_index] x = pad_sequences([pattern], maxlen=max_len, dtype='float32') x = np.reshape(x, (1, max_len, 1)) x = x / float(len(alphabet)) prediction = model.predict(x, verbose=0) index = np.argmax(prediction) result = int_to_char[index] seq_in = [int_to_char[value] for value in pattern] print(seq_in, "->", result) |

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running this code produces the following output:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
Model Accuracy: 98.90% ['Q', 'R'] -> S ['W', 'X'] -> Y ['W', 'X'] -> Y ['C', 'D'] -> E ['E'] -> F ['S', 'T', 'U'] -> V ['G', 'H', 'I', 'J', 'K'] -> L ['O', 'P', 'Q', 'R', 'S'] -> T ['C', 'D'] -> E ['O'] -> P ['N', 'O', 'P'] -> Q ['D', 'E', 'F', 'G', 'H'] -> I ['X'] -> Y ['K'] -> L ['M'] -> N ['R'] -> T ['K'] -> L ['E', 'F', 'G'] -> H ['Q'] -> R ['Q', 'R', 'S'] -> T |

You can see that although the model did not learn the alphabet perfectly from the randomly generated subsequences, it did very well. The model was not tuned and might require more training, a larger network, or both (an exercise for the reader).

This is a good natural extension to the “*all sequential input examples in each batch*” alphabet model learned above in that it can handle ad hoc queries, but this time of arbitrary sequence length (up to the max length).

## Summary

In this post, you discovered LSTM recurrent neural networks in Keras and how they manage state.

Specifically, you learned:

- How to develop a naive LSTM network for one-character to one-character prediction
- How to configure a naive LSTM to learn a sequence across time steps within a sample
- How to configure an LSTM to learn a sequence across samples by manually managing state

Do you have any questions about managing an LSTM state or this post?

Ask your questions in the comment, and I will do my best to answer.

Great series of posts on LSTM networks recently. Keep up the good work

Thanks Mark.

For anyone, who wants to try out trained LSTMs, there is an interactive chat box: http://www.mlowl.com/post/character-language-model-lstm-tensorflow/

It offers models trained on Wikipedia, Congress speeches, Sherlock Holmes, South Park and Goethe.

I can also recommend the tensorlm package on GitHub: https://github.com/batzner/tensorlm

Thanks for sharing.

I like this post, it gave me many enlightenment.

Thanks Atlant.

I’m probably missing something here but could you please explain why LSTM units are needed in the alphabet example where any output depends directly on the input letter and there is no confusion between different input -> output pairs at all?

It is a demonstration of the algorithms ability to learn a sequence. Not just input-output pairs, but input-output pairs over time.

hi, I got a little confused there. What does the LSTM units mean?

Thanks

Hi Randy, the LSTM units are the “memory units” or you can just call them the neurons.

Hi Jason,

Thank you very much for your fantastic tutorials. I learnt a lot through them.

Reccurent networks are quite complex and confusing to use. And like Shanbe, I am a bit confused with the examples as each output does not seem to have a dependency to the previous inputs but just a direct link to the current input. I am probably missing something too, but I don’t see how it demonstrates the benefits of using a memory rather than a direct Dense layer.

Actually the 100% accuracy can be achieved by just using the last layer and removing the LSTM layer (but it takes a bit more epoch).

I think the confusion is coming from the fact that the letters are encoded into integers. This would not be the case if the neural network could handle letters directly. In this example, they are encoded in the same order and therefore there is a direct relationship between the numbers. We are just teaching the network that the output equals the input+constant. Which is easy to obtain with a simple regression with one cell. But we still need the 26 outputs for the decoding at the end.

Maybe I am lost, but I think the demonstration would make more sense if the letters were scrambled (to avoid the possibility to have a simple linear combination which is solved by regression with one cell) and if we were trying to predict a previous letter rather than the next one. The use of a memory would be meaningful. But I didn’t try.

Nevertheless, the examples are still very interesting to show different approaches to the problem.

Thank you for sharing these educational examples! Appreciate it if you can elaborate on the “LSTM State Within A Batch” example. The confusing part is on the explanation “that the LSTM could use the context of the sequence within the batch to better learn the sequence.”, which may imply that the states of LSTM are reused within the training of a batch, and this motivated the setting of parameter shuffle as False.

But my understanding of Keras’s implementation is that the LSTM won’t reuse the states within a batch. In fact, the sequences in a batch is kinda like triggering the LSTM “in parallel” – in fact the states of LSM should be of shape (nsamples, nout) for both gates – separate states for each sequence – this is what id described [Keras document](https://keras.io/getting-started/faq/#how-can-i-use-stateful-rnns): states are reused by the ith instance for successive batchs.

This means that even parameter shuffle is set as True, it will still give you the observed performance. This also explains why the predictions on random patterns were also good, which was opposite to the observations in the next example “Stateful LSTM for a One-Char to One-Char Mapping”. The reason why setting a bigger batch size resulted in better performance than the first example, could be the bigger nb_epoch used.

Appreciate your opinions on this! It’s a great article anyway!

You are right, that state from sequence to sequence inside one batch.

I do agree. Better result came with large batch size, because it reaches minima faster and achieve good accuracy in 5000 epochs. If we increase number of epochs when batch_size =1, we can get 100% accuracy.

I think the accuracy increased because there was no resetting of the network as all the samples were fitted in the batch size and as the default resetting of the network takes place after a batch size no of samples whereas in those with batch size=1 the resetting of the network was taking place after each batch size which was actually equal to 1 sample

I do agree with your understanding, that the states within each batch are updated separately rather than can be used as context information. [This discussion](https://stackoverflow.com/a/46331227/7653982) gives more details about the parameter

`stateful`

.Thank you for your amazing tutorial.

However, what if I want to predict a sequence of outputs? If I add a dimension to the output it is gonna be like a features window and the model will not consider the outputs as a sequence of outputs. It is like outputs are independent. How can I fix that issue and have a model which for example generates “FGH” when I give it “BCDE”.

Hadi, I use a “one-hot” encoding for my features. This causes the output to be a probability distribution over the features. You may then use a sampling technique to choose “next” features, similar to the “Unreasonable Effectiveness of RNN” article. This method includes a “temperature” parameter that you can tune to produce outputs with more or less adherence to the LSTM’s predictions.

One more follow-up to this problem… you may be interested in a different type of network, such as a Generative Adversarial Network (GAN) https://github.com/jacobgil/keras-dcgan

Also for my part than you for the tutorial.

However, I have a few related questions (also posted in StackOverflow, http://stackoverflow.com/questions/39457744/backpropagation-through-time-in-stateful-rnns): If I have a stateful RNN with just one timestep per batch, how is backpropagation handled? Will it handle only this one timestep or does it accumulate updates for the entire sequence? I fear that it updates only the timesteps per batch and nothing further back. If so, do you think this is a major drawback? Or do you know a way to overcome this?

Hi Alex,

I believe updates are performed after each batch. This can be a downside of having a small batch. A batch size of 1 will essentially perform online gradient descent (I would guess). Developers more familiar with Keras internals may be able to give you a more concrete answer.

Hi Jason,

thank you very much for your answer.

I posed my question wrongly because I mixed up “batch size” and “time steps”. If I have sequences of shape (nb_samples, n, dims) and I process them one time step after the other with a stateful LSTM (feeding batches of shape (batch_size, 1, dims) to the network), will backpropagation go through the entire sequences as it would if I processed the entire sequence at once?

The answer does not change, updates happen after each batch.

“The Keras implementation of LSTMs resets the state of the network after each batch.”

Could you please explain what you mean by “resets the state”? What happens to the network state after each epoch?

Thanks!

I don’t know if it is clear or just unreliable. But it is gone.

Your network will not behave as you expect.

it means all the memory inside a neuron that it has learned over a batch size will be lost and the new sequence or sample will be trained with fresh neurons or cells which are having no state

I think that is incorrect. The state relates to the internal state of the LSTM cell. The state carries the experience from earlier sequencies to the current sequence and “mixes” this previous experience with new data input.

If at the end of a batch, the state is reset, it is this experience from the previous batch(es) that is lost.

On the other hand, the weights are set by back propagation and improve with each training and throughout the epochs. You don’t lose the weights when the state of the LSTM is reset.

Hi Jason,

first, thanks for the amazing tutorial!

I got two quick questions:

1) if I want to train “LSTM with Variable Length Input to One-Char Output” on much simpler sequence (with inserted repetitive pattern):

seq_1 = “aaaabbbccdaaaabbbccdaaaabbbccd….”

or even

se1_2 = “ababababababababababababababab…”

I can’t do any better than 50% of accuracy. What’s wrong?

2) If I want to amend your code to N-Char-Output, how is it possible? So, given a sequence “abcd” -> “ef” ?

Many thanks in advance!

I found my mistake: the char-integer encoding should be:

chars = sorted(list(set(alphabet)))

char_to_int = dict((c, i) for i, c in enumerate(chars))

int_to_char = dict((i, c) for i, c in enumerate(chars))

but, it does not help too much.

Also, I’m very curious to know how to predict N-characters

Hi Arnold,

Nice change. No idea why it’s not learning it. Perhaps you need more nodes/layers or longer training? Maybe stateful or stateless?

Perhaps it’s the framing of the problem?

This is a sequence to sequence problem. I expect you can just change the output node from 1 neuron to the number you would like.

Hi Jason,

Thanks for your suggestions. I will try.

Arnold,

have you acheived that kind of implementation? 🙂

i’m also interested

Hi Jason,

Thanks for the amazing tutorial and other posts!

How to amend your code to predict N-next characters and not only one?

something like: “LSTM with Variable Length Input to N-Char Output”

Hi Jason,

Great tutorial as always. It was fun to run through your code line by line, work with a smaller alphabet (like “ABCDE”), change the sequence length etc just to figure out how the model behaves. I think I’m growing fond of LSTMs these days.

I have a very basic question about the shape of the input tensor. Keras requires our input to have the form [samples, time_steps, features]. Could you tell me what the attribute features exactly means?

Also, consider a scenario for training an LSTM network for binary classification of audio. Suppose I have a collection of 1000 files and for each file, I have extracted 13 features (MFCC values). Also suppose that every file is 500 frames long and I set the time_steps to 1.

What would be my input shape?

1. [Number of files, time_steps, features] = [1000, 1, 13]

or

2. [Number of files * frames_per_file, time_steps, features] = [1000*500, 1, 13]

Any answer is greatly appreciated !! Thanks.

Thanks Madhav.

The features in the input refers to attributes or columns in your data.

Yes, your framing of the problem looks right to me [1000, 1, 13] or [samples, timesteps, features]

Thank you for the enlightening series of articles on LSTMs!

Just a minor detail, the complete code for the final example is missing an import for pad_sequences:

from keras.preprocessing.sequence import pad_sequences

Fixed.

Thanks for pointing that out Rob!

Hello Jason

What if I want to predict a non sequential network.(eg:- the next state of A may be B,C,Z or A itself depending on the previous values before A). Can I use this same logic for that problem?

Yes Panand, LSTMs can learn arbitrary complex sequences, although you may need to scale up memory units and layers to account for problem complexity.

Let me know how you go.

Hi Jason,

Thank you for your tutorials, I learn a lot from them.

However, I still don’t quite understand the meaning of batch size. In the last example you set batch_size to be 1, but the network still learned the next letter in the sequence based on the whole sequence, or was it just based on the last letter every time?

What would have happened if you set batch_size=3 and all the sequences would be at minimal length 3?

Thank you

batch size is the number of samples after which the system weights are updated and the network is reset, in this example, the sample is a single alphabet.

Hi Jason ,

Thank you for this wonderful tutorial.

Motivated by this I got myself trying to generate a sine wave using RNN in theano

After running the code below

the sine curve predicted in green is generated well when I give an input to all the time steps at prediction

but the curve in blue is not predicted well where I give the input to only the first time step while prediction (This kind of prediction is you have used for character sequence generation)

Is there a way I could make that work. Cause I need a model for generating captions.

#Learning Sine wave

import theano

import numpy as np

import matplotlib.pyplot as plt

import theano.tensor as T

import math

theano.config.floatX = ‘float64’

## data

step_radians = 0.01

steps_of_history = 200

steps_in_future = 1

index = 0

x = np.sin(np.arange(0, 20*math.pi, step_radians))

seq = []

next_val = []

for i in range(0, len(x)-steps_of_history, steps_of_history):

seq.append(x[i: i + steps_of_history])

next_val.append(x[i+1:i + steps_of_history+1])

seq = np.reshape(seq, [-1, steps_of_history, 1])

next_val = np.reshape(next_val, [-1, steps_of_history, 1])

trainX = np.array(seq)

trainY = np.array(next_val)

## model

n = 50

nin = 1

nout = 1

u = T.matrix()

t = T.matrix()

h0 = T.vector()

h_in = np.zeros(n).astype(theano.config.floatX)

lr = T.scalar()

W = theano.shared(np.random.uniform(size=(3,n, n), low=-.01, high=.01).astype(theano.config.floatX))

W_in = theano.shared(np.random.uniform(size=(nin, n), low=-.01, high=.01).astype(theano.config.floatX))

W_out = theano.shared(np.random.uniform(size=(n, nout), low=-.01, high=.01).astype(theano.config.floatX))

def step(u_t, h_tm1, W, W_in, W_out):

h_t = T.tanh(T.dot(u_t, W_in) + T.dot(h_tm1, W[0]))

h_t1 = T.tanh(T.dot(h_t, W[1]) + T.dot(h_tm1, W[2]))

y_t = T.dot(h_t1, W_out)

return h_t, y_t

[h, y], _ = theano.scan(step,

sequences=u,

outputs_info=[h0, None],

non_sequences=[W, W_in, W_out])

error = ((y – t) ** 2).sum()

prediction = y

gW, gW_in, gW_out = T.grad(error, [W, W_in, W_out])

fn = theano.function([h0, u, t, lr],

error,

updates={W: W – lr * gW,

W_in: W_in – lr * gW_in,

W_out: W_out – lr * gW_out})

predict = theano.function([h0, u], prediction)

for e in range(10):

for i in range(len(trainX)):

fn(h_in,trainX[i],trainY[i],0.001)

print(‘End of training’)

x = np.sin(np.arange(20*math.pi, 24*math.pi, step_radians))

seq = []

for i in range(0, len(x)-steps_of_history, steps_of_history):

seq.append(x[i: i + steps_of_history])

seq = np.reshape(seq, [-1, steps_of_history, 1])

testX = np.array(seq)

# Predict the future values

predictY = []

for i in range(len(testX)):

p = testX[i][0].reshape(1,1)

for j in range(len(testX[i])):

p = predict(h_in, p)

predictY= predictY + p.tolist()

print(predictY)

# Plot the results

plt.plot(x, ‘r-‘, label=’Actual’)

plt.plot(np.asarray(predictY), ‘gx’, label=’Predicted’)

predictY = []

for i in range(len(testX)):

predictY= predictY + predict(h_in, testX[i]).tolist()

plt.plot(np.asarray(predictY), ‘bo’, label=’Predicted’)

plt.show()

Many Thanks

Hi,

I think there is a mistake in the paragraph “LSTM State Within A Batch”.

You say “The Keras implementation of LSTMs resets the state of the network after each batch.”.

But the fact is that it resets its state event between the inputs of a batch.

You can see the poor performances (around .5) of this LSTM that tries to remember its last input : http://pastebin.com/u5NnAx9r

As a comparison, here is a stateful LSTM that performs well : http://pastebin.com/qEKBVqJJ

I think you are right. it resets the state after each input. But it holds the state between timesteps within one input.

Thank you very much.

It is the best description I’ve ever seen about LSTM. I got a lot of benefits from your post!

Thanks Mazen.

First of all, thaks for create this easy but very useful example to learn and understand better how the LSTM nets work.

I just want to ask you something about the first part, why do we use 32 units? was it a random decision or does it have a theoric fundament?

I will thank you your answer.

An ad hoc choice Fernando. Great question.

Hi Jason,

thanks for the amazing tutorial!

I’m glad you found it useful leila.

Thank you for this post but i am a beginner and i have a question

you said “the state of the network is reset after each pattern” what does that mean?

LSTMs maintain an internal state, it is a benefit of using them.

This internal state can be reset automatically (after each batch) or manually when setting the “stateful” argument.

Great tutorial. Question: Is it possible to do a stateful LSTM with variable input (like the last example in the tutorial)? I am curious why you used a naive stateless LSTM for that.

Yes you can.

Thank you for the amazing tutorial. The walkthrough (starting with simple and building on it) really helped.

You’re welcome Bikram.

I have one question in regards to the LSTM. I asked this question in details here on reddit, and after than came to know about your tutorial here.

I feel that with stateful LSTMs, I am getting the gist of the RNNs, but still has it been ever successfully used for video processing — like the problem I described here https://www.reddit.com/r/AskReddit/comments/681c76/understanding_lstm_stateful_stateless_how_to_go/

Basically, here, with stateful LSTM when the network was fed “K”, it predicted B, which is wrong. Does it mean that it is just memorized sequence and will always predict B irrespective of what is fed. How can we extend it to general video prediction tasks. Please look at the link above.

thanks in advance

The link does not appear to work.

LSTMs require a very careful framing of your sequence problem in order to learn effectively. They also ofter require a much larger network trained for much longer than intuitions on MLPs may suggest.

Hello.

I have a question about RNN-LSTM network, how can we decide input_shape size? how can we draw shape of RNN about this problem?

Great question.

The LSTMs in keras expect the shape to be [samples, timesteps, features].

Each sample is a separate sequence. The sequence steps are time steps. Each observation measurement is a feature.

If you have one very long sequence, you can break it up into multiple smaller sequences.

I hope that helps.

Could we use the LSTMS RNN in Keras to create the Chatbot Conversation Model.

Perhaps, I do not have an example sorry.

Sometimes it is helpful to extract LSTM outputs from certain time steps. For example, we may want to get the outputs corresponding to the last 30 words from a text, or the 100th~200th words from a text body. It would be great to write an instruction about it.

Thanks for the suggestion.

Can you give an example of what you mean?

Hi Dear Sir (Jason) .

i want to use LSTM for words instead of alphabets. how can i implement that. more over can i use that part of speech tagging ? part of speech tagging is also a sequential problem because it also dependent on context.

thanks

You can, but I do not have any examples sorry.

Hi Jason,

Thank you for this great tutorial. I really appreciated going through your LSTM code.

I have one question about the “features”: you are mentionning them but I don’t see where you are including them. Do they represent a multivariate problem? How do we handle that?

Many thanks,

Carlos

I do not have a multivariate example at this stage, but you can specify features in the input. The format is [samples, timesteps, features].

Thanks for this great tutorial!

I have a question. I am deadling with kinda same problem. However, each character in a sequence is a vector of features in my input data.

Instead of [‘A’, ‘B’, ‘C’] -> D I have [[0, 0,1, 1.3], [6,3,1,1.5], [6, 4, 1.4, 4.5]] -> [1, 3, 4]

So considering all sequences, my data is in 3d shape.

Could someone help me how to configure the input for LSTM?

LSTM input must be 3D, the structure is [samples, timesteps, features].

If you one hot encode, then the number of labels (length of binary vectors) is the number of features.

Does that help?

It does! for clarification:

If I have 2000 sentences (sequence of words)

Each sentence is having 3 words,

and each word is converted into a vector of length 5 (5 features):

X1 = [[1, 2, 3, 4, 5], [6, 7, 8, 9, 10], [11, 12,13,14,15]], ….., X2000

Data_X = [X1, …, X2000]

So my X_train is gonna be like this?

X = numpy.reshape(Data_X, (2000, 3, 5))

Yes.

But don’t be afraid to explore other representations that may result in better skill on your problem. e.g. maybe it makes sense to provide n sentences as time steps.

Thanks Jason!

I really appreciate your attempt in these tutorials.

I am working with a huge imbalanced sequential file, do you have any suggestion regarding improving these type of imbalanced files?

I do not have experience with imbalanced sequence data.

Perhaps methods for imbalanced non-sequence data will give you some ideas:

https://machinelearningmastery.mystagingwebsite.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

btw, assuming that the output is not high dimentional:

Y1 = [11, 22, 33]

Hello, In this example you were able to go from ABC -> D (3 to 1)

Is there a way there Train a model to go from ABC -> BCD

or perhaps some other pattern like ABC- >CBA( switching first and last)

Yes, you can have 3 neurons in the output layer.

Great tutorial and one that has come closest to ending my confusion. However I still not 100% clear. If I have time-series training data of length 5000, and each row of data consists of two features, can I shape my input data (x) as a single batch of 5000,1,2? And if this is the case — under what circumstances would I ever need to increase the time step to more than 1? I’m struggling to see the value the time step dimension is adding if the LSTM remembers across an entire batch (which in my above scenario would be like a time step of 5000 right?)

Yes.

The number of time steps should be equal to the number of time steps in your input sequence. Setting time steps to 1 is generally not a good idea in practice beyond demonstrations.

Hi Jason,

Thanks for your tutorials, I just have a confusion on stateful LSTM in keras.

I know the hidden state will pass through different timesteps. Will the hidden state pass among one batch？

Or samples in one batch have different initialized hidden states?

The internal state is reset at the end of each batch.

You can carry the state across batches by making the layer stateful and taking control of when the state is reset.

Does that help?

Hi Jason,

Is one hot encoder needed for X? Since they are also categorical.

Yes, the input can be one hot encoded.

It’s a awesome tutorial, but I still have some problem.When we use a LSTM/RNN, we usually initialize the state of LSTM/RNN with some random way like Orthogonal, therefore, when we use the LSTM for predicting, the initial state may be or must be different from training, how can it get the right prediction always.Even more, when we train a LSTM with stateful=False, the initial state will be reset, which means initialize randomly, how can it always get a right model?

wait for your answer, thank you!

See this post for advice on how to evaluate neural nets like LSTMs given their stochastic nature:

https://machinelearningmastery.mystagingwebsite.com/evaluate-skill-deep-learning-models/

I have experiments with all the examples. It seemes for me that only one example “Stateful LSTM for a One-Char to One-Char Mapping” showing the nature of LSTM/RNN. All other examples simply work line ordinary Dense networks.

I’ve tried to work with sequence “ABCDEFABCXYZ”

It learned well on Stateful network.

Model Accuracy: 100.00%

[‘A’] -> B

[‘B’] -> C

[‘C’] -> D

[‘D’] -> E

[‘E’] -> F

[‘F’] -> A

[‘A’] -> B

[‘B’] -> C

[‘C’] -> X

[‘X’] -> Y

[‘Y’] -> Z

But in random test it makes mistakes:

Test a Random Pattern:

[‘Y’] -> B

[‘A’] -> C

[‘C’] -> D

[‘A’] -> E

[‘X’] -> F

[‘Y’] -> A

[‘D’] -> B

[‘B’] -> C

[‘A’] -> X

[‘A’] -> Y

[‘B’] -> Z

[‘C’] -> Z

[‘E’] -> Z

[‘B’] -> Z

[‘C’] -> Z

[‘C’] -> A

[‘B’] -> A

[‘C’] -> A

[‘A’] -> B

[‘F’] -> B

How to fix that?

A more generic model needs to be developed, perhaps one that uses the context for the last n characters (time steps) to predict the next character.

In the final example with variable length inputs, are the vectors generated by the call to pad_sequences() sensible?

For example:

RS -> T

Becomes

[[0], [0], [0], [17], [18]]

which is the equivalent of

AAARS -> T

Is that really what is wanted?

That might be a fault. Thanks!

Yeah, I can’t work out if it is a good thing or a bad thing.

In its favour, it does train (and eventually approaches 100% accuracy if you up the batch size and epochs a bit).

On the other hand, it does seem that the training data is wrong and that it is learning the correct result in spite of it. I have tried an alternative dataset where the padded data is a random selection of letters. It doesn’t work particularly well for sequences of only 1 letter, but longer sequences seem OK.

I have pasted the new version below. Note that I have increased the dataset size, batch size and training epochs.

This LSTM business is a bit more subtle than I initially thought…. thanks for posting this tutorial though, I really like the “this is how to do it wrong” style of the first few examples. Very useful.

import numpy

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from keras.utils import np_utils

from keras.preprocessing.sequence import pad_sequences

# fix random seed for reproducibility

numpy.random.seed(7)

# define the raw dataset

alphabet = “ABCDEFGHIJKLMNOPQRSTUVWXYZ”

# create mapping of characters to integers (0-25) and the reverse

char_to_int = dict((c, i) for i, c in enumerate(alphabet))

int_to_char = dict((i, c) for i, c in enumerate(alphabet))

# prepare the dataset of input to output pairs encoded as integers

num_inputs = 10000

max_len = 5

dataX = []

dataY = []

for i in range(num_inputs):

start = numpy.random.randint(len(alphabet)-2)

end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))

sequence_in = alphabet[start:end+1]

sequence_out = alphabet[end + 1]

dataX.append([char_to_int[char] for char in sequence_in])

dataY.append(char_to_int[sequence_out])

print(sequence_in, ‘->’, sequence_out)

for n in range(len(dataX)):

while len(dataX[n]) “, result)

#test arbitrary patterns

pattern = [20, 21, 22]

while len(pattern) “, result)

Did that listing get trimmed?

import numpy

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

from keras.utils import np_utils

from keras.preprocessing.sequence import pad_sequences

# fix random seed for reproducibility

numpy.random.seed(7)

# define the raw dataset

alphabet = “ABCDEFGHIJKLMNOPQRSTUVWXYZ”

# create mapping of characters to integers (0-25) and the reverse

char_to_int = dict((c, i) for i, c in enumerate(alphabet))

int_to_char = dict((i, c) for i, c in enumerate(alphabet))

# prepare the dataset of input to output pairs encoded as integers

num_inputs = 10000

max_len = 5

dataX = []

dataY = []

for i in range(num_inputs):

start = numpy.random.randint(len(alphabet)-2)

end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1))

sequence_in = alphabet[start:end+1]

sequence_out = alphabet[end + 1]

dataX.append([char_to_int[char] for char in sequence_in])

dataY.append(char_to_int[sequence_out])

print(sequence_in, ‘->’, sequence_out)

for n in range(len(dataX)):

while len(dataX[n]) “, result)

#test arbitrary patterns

pattern = [20, 21, 22]

while len(pattern) “, result)

Hmm, I don’t seem to be able to post the full listing. Sorry.

Consider using pre HTML tags when posting code.

I think the pure solution is to re-map the letters in the range 1-26 and reserve 0 for padding.

Hey Prof Brownlee,

You do great work. I appreciate the diligent and detailed tutorials – thank you very much.

My question pertains to how lstm states might apply to non-“sequential” (but still time dependent) data, a la your air pollution tutorial –

https://machinelearningmastery.mystagingwebsite.com/multivariate-time-series-forecasting-lstms-keras/

1) Does setting an LSTM to [stateful = false] and using batch size 1 basically turn an LSTM into a more complicated Feedforward net, i.e., a neural net with no memory or knowledge of sequence?

2) After training a “stateful” model above, you reset the states and then make predictions. This means (wrt to the standard lstm equations, found here – https://en.wikipedia.org/wiki/Long_short-term_memory) that the previous cell state, and the hidden state are zero. You can see this by using something like model.get_layer(index = 1).states[0].eval(). This is also true of a non-stateful model – the states are listed as [none] in keras. The confusing part is that resetting the state makes the forget gate and the “U” weights zero-out (as per the equations). Yet, as we see in your above tutorial, you can still make accurate predictions! It makes me wonder why we have a forget gate and U weights at all?

If my questions are confusing in any way, please let me know. Thanks ahead of time for your attention, and for these awesome tutorials.

Hi Dan, great questions.

No, LSTM memory units are still very different to simple neurons in a feed-forward network.

See this tutorial that forces the model to only use the internal state to predict the outcome:

https://machinelearningmastery.mystagingwebsite.com/memory-in-a-long-short-term-memory-network/

Dear Jason;

Thanks for your useful tutorials. I have question regarding the ‘Accuracy’. As we see in first example “Naive LSTM”, accuracy is 84% but non of predictions are correct. My question is how this accuracy has been calculated?

Fantastic post, as always Jason. I especially appreciate the way you took time to show downside of alternatives to a stateless model, like piling in extra features. You say you are not a professor, but you think like a teacher, and not many are bothering to do that with such recent technologies.

I have a question about the requirement that we specify the batch size when an LSTM is stateful. In his blogpost on this subject (http://philipperemy.github.io/keras-stateful-lstm/) Philippe Remy says (my paraphrase) that:

* For each observation (row) in the input data Keras saves output_dim states (where output_dim = the number of cells in the LSTM), so that

* After processing a batch, it will have collected an array of state information of size (batch_size, output_dim)

* Since state straddles batch boundaries, we must specify the shape of that state array (batch_size, output_dim) in the model definition.

My questions:

* Is that an accurate description?

* May I infer that state information straddles observations (rows of ip data, sequences, whatever you want to call it)?

* May I infer that state information straddles observations even in “stateless” LSTMs? (even though it does *not* straddle batches)

* If the LSTM keeps old state information around for the entire sequence, does that mean that it could potentially operate on an old state (say the state held 4, 5 or 50 cells ago)?

* If so, does that make LSTMs resemble autoregressive routines with a *huge* look back, which are thus directly (not just implicitly) dependent on state values that occurred many steps before?

* Is that explained in the original LSTM paper by Hochreiter?

Thanks John.

I cannot speak to that post, I have not read it.

A batch is a group of samples. After a batch the weights are updated and state is reset. One epoch is comprised of 1 or more batches.

LSTMs are poor at autoregression. They are not learning a static function on lag observations, instead, they learn a more complex function with the internal state acting as internal variables to the function. Past observations are not inputs to the function directly, rather they influence the internal state and t-1 output.

Not sure if that helps.

Yes, it helps, because it eliminates the possibility of past states (older than t – 1) directly influencing current state. But then why does the LSTM algorithm keep state information for each and every sample (sequence) in the batch?

And if it is not true that it keeps past state information for each sample, then why does the stateful LSTM need to know the batch size?

The LSTM accumulates state over the exposure to each sample, it does not keep state for each separate sample.

The stateful LSTM only needs to know the batch size for Keras reasons, e.g. efficiency of the implementation when running on different hardware.

Dear Jason;

Thanks for your useful tutorials. I have question regarding the ‘Accuracy’. As we see in first example “Naive LSTM”, accuracy is 84% but non of predictions are correct. My question is how this accuracy has been calculated?

Hey Jason,

I tried your code on Stateful LSTM for a One-Char to One-Char Mapping and the acc is surprisingly low during training (like ~30%). Is there anything wrong? However I can reach 100% by changing the batch size to 25 and switch the iteration number to a much larger one (like 3000), but I guess that’s not your purpose on this one.

Also, if we manually reset the states after each epoch, how can the model to remember things?

Thanks.

I would recommend playing with the configuration and re-run examples multiple times to overcome the stochastic nature of the algorithm.

We want to reset memory between samples, generally. Learn more about BPTT here:

https://machinelearningmastery.mystagingwebsite.com/gentle-introduction-backpropagation-time/

Hello, Jason.

In the Keras Stateful model I want to put batch_size more than 2.

I want to learn a number of independent datasets in one model. To do this, batch_size must be set to the number of independent datasets.

But I can only think of it like this and I do not know how to put data into the model.

If you have a good website or example code, please share it.

I always appreciate fantastic posting and seeing well.

The batch size and state reset are decoupled when you make the LSTM stateful. The batch size will only control when weights are updated.

Hi Jason,

I join all the people thanking you for the amazing post. I also have a question related to “more than 1 datasets”.

As you said in the post, stateless LSTMs update the weights and reset the state at the end of every batch, while stateful LSTMs update the weights at the end of each batch but reset the state only at the end of the epoch, where we have control on the reset function, right?

Imagine I have N independent trajectories and I want to be able to generalize to new ones. My input at each time step is the latest (x,y) point and I want to predict the next one, making the LSTM stateful such as to keep memory of previous inputs and, hopefully, improve the prediction as I get more and more points. My question, therefore, is:

– if I reshape my overall dataset as a (len(single_dataset)*N, 1, len(features)), appending one dataset after the other and setting batch_size=len(single_dataset), is there a way to reset the memory between one batch and the other (meaning every time I move to a new dataset)? I think no, because I should do this inside the model.fit() function, but I still want your opinion;

– or should I reshape my overall dataset as before but now setting the batch_size=N, having a single batch containing the first point for all the N dataset, then the second batch containing the second point and so on? I believe these would be preferable, but I wonder if, at test time, I can use the same model (or define a new one) with batch_size=1 and predict one-to-one in a stateful way only for a single trajectory;

– is there a third way you can think of to solve the problem?

Thanks.

Yes, it’s all code. You can run through your data and reset states any way you wish. Perhaps try experimenting in order to build confidence?

Also, I think timesteps=1 is a code smell. Try an alternate framing.

Hallo Jason,

I’m wondering whether there is a relation between sequence_length and number_units!!

I’d imagine that number_units should be greater than (or at least equal) to sequence_length in order to make LSTM able to handle a sample of length = sequence_length without losing sequential information.

For example, if we have samples, each of length = 30 (from t0 to t29), then number_units=32 can be suitable because first unit handles t0, second unit handles t1 taking the first unit’s output into account, …. and unit 30 handles the last time step t29 taking all elapsed time steps (i.e. t0 to t28) into account. Thus, if we choose hereby e.g. number_units=20, then last ten time steps (i.e. t20 to t29) will not be handled by the LSTM, and thus the model will lose sequential information this way.

Is this correct?

Many thanks for your help,

Best regards,

Mazen

No relation (other than vague notions of network capacity).

Thank you.

Ok, for compilation and running the model there exits no relation. But for having a suitable model that can handle the entire expected sequential data, should num_units >= sequence_length? Because each LSTM-Unit handles a time step and thus if num_units num_units) will not be handled and their sequential information will be lost!!

Is this understanding correct?

This is a correction for mistyping:

…. Because each LSTM-Unit handles a time step and thus if num_units num_units) will not be handled and their sequential information will be lost!!

Is this understanding correct?

I apologize that my comments are always somehow corrupted when I submitted them 🙁

So, what I mean is whether my understanding is accurate or not:

For example, if we have samples, each of length = 30 (from t0 to t29), then number_units=32 can be suitable because first unit handles t0, second unit handles t1 taking the first unit’s output into account, …. and unit 30 handles the last time step t29 taking all elapsed time steps (i.e. t0 to t28) into account. Thus, if we choose hereby e.g. number_units=20, then last ten time steps (i.e. t20 to t29) will not be handled by the LSTM, and thus the model will lose sequential information this way.

Is my understanding correct?

The number of units and sequence length are unrelated.

Hi. Jason.

I want to know why you reshape ‘X’ and ‘x’ like above, when you use the window method.

I mean, When you design the simple LSTM model (one char to one char mapping) you reshape ‘dataX’ to ‘X’. And ‘X’ has an arrangement of ( len(dataX), seq_length, 1).

However, when you design the LSTM model using the window method, you reshape ‘X’ as an array of ( len(dataX), 1, seq_length).

And I wonder what is the difference between these arrays.

I’m sorry that I am a newbie of machine learning and even in Python.

So could you explain specifically ,please? Thank you.

This post may help:

https://machinelearningmastery.mystagingwebsite.com/reshape-input-data-long-short-term-memory-networks-keras/

Hi Jason,

thanks for the tutorial. I tried your code with several other time series datasets, and I have the feeling that it working “too well”. I’m not an expert in NN, but I know ML quite well to say that there “might” be something wrong in my predictions – overfitting or similar (e.g. also on the dataset you use, the results I get are definitely better than yours, but this might depend on different versions of Keras, as well as different splitting of the data, etc…). So, I’m trying to figure out things.

My question is then about the stateful mode of the LSTM. Does it affect also the prediction? Or better, is the model stateful also in prediction mode?

For example, assume my test set is made of 100 time points (sequential examples), and I give them to the network one by one in the right sequence, as you do. Then, when the model predicts a value for the i-th example, is it storing or taking into account that, so far, the sequence has been of the examples from 1 to i-1? Does it preserve and use this historical information somehow for the i-th prediction?

PS: sorry if this answer has been already asked. It tried to search in the thread, but it was quite long.

My best advice would be to test and find out whether stateful LSTMs make a difference on your dataset.

Great tutorial. Hi Jason i got a sequence with variable input and output like ABC->DEF, AB->CDEF, C->DEFG etc. can you help me with it.

Thank you 🙂

Perhaps you can use the above post as a starting point?

Hi Jason, your posts have been amazing for helping me understand LSTM. I just have a question about stateful.

Lets assume the following setup:

time_steps = 10

batch_size = 5

This means that within each batch, I have 5 samples, each with 10 rows of sequential data (for a total 50 rows).

I know that setting Stateful=True means that the hidden states are transferred from one batch to another, but how about within each batch?

According to the keras documentation, “If True, the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch.”

This sounds like the hidden state for the first sample of batch 1 gets used in the first sample of batch 2, and the hidden states for the 2nd sample of the first batch gets used in the 2nd sample of the second batch.

Does this mean that the hidden states don’t transfer between samples WITHIN a batch? i.e. the hidden states from the first timesteps of my batch don’t transfer to the second sample.

Under this assumption, it seems that my batch size should be 1 when working with time series data.

Thank you!

No, it means that the state is accumulated over samples within the batch and used as the first state on the next batch.

Actually in the “Stateful LSTM for a One-Char to One-Char Mapping” section, the network already shows signs of overfitting.

When you start the new seed ‘K’, the network predict the next character as ‘B’ instead of ‘L’. I haven’t tried this network myself yet, but I have the feeling that no matter what seed you feed in, it will always predict the next character as ‘B’.

Any way to fix this problem?

Perhaps less training or a larger training set.

I directly copy your code for stateful LSTM to learn one-char to one-char mapping, but the code does not output the same result shown in your article. I do not understand where is the problem. The accuracy of the model is very pool.

Model Accuracy: 36.00%

A -> B

B -> C

C -> C

C -> D

D -> E

E -> F

F -> G

G -> H

H -> I

I -> I

I -> L

L -> L

L -> M

M -> O

O -> P

P -> P

P -> R

R -> S

S -> S

S -> U

U -> U

U -> U

U -> Y

Y -> Z

Z -> Z

New start: K

K -> B

B -> C

C -> C

C -> D

D -> D

Perhaps try running the example a few times?

Hi Jason,

First of all thank you very much for your blogs. I benefited from the ones especially which are about LSTM very much. I appreciate your contributions very much.

However I have some difficulties while practicing what I have learned.

I try to predict device faults one day before by using alarm data. Let me summarize data in short.

Train data is 100*1000 array which includes only 0 and 1. Rows represent the index of the day (sequential 100 days).

Columns size is the dictionary of the alarms which means there are 1000 different kinds of alarms. And if the alarm occurs I keep it is 1 otherwise as 0.

To be more clear, if an alarm (which is the one corresponding to column index 200) occurs at day i, X[i,200] = 1

As Y values (labels), lets say I have an array sized as (100*50). Y values represents the device faults. There are 50 different kinds of device faults which I try to predict.

I keep the distributions of device faults per day like [0, 0, 0, 1, 0, 1, …] just like similar as train data.

So I want to predict device faults which occur day 101 by using the data of first 100 days alarm and device fault data.

I think it is a kind of sequence to sequence prediction. For that kinds of problem, should I use a stateful LSTM additionaly return_sequence = True? And how can I shape input and output data?

Or designing 50 different LSTM which make binary classification for each device fault is a better approach?

If you give me any advice, I really appreciate it.

Thanks in advance.

Good question.

We cannot know the best model for the problem. I recommend testing a suite of different framings and modeling methods to see what works best.

I would suggest trying one model over multiple models to allow the opportunity to generalize across cases.

If it is a time series classification problem, which it does sound like, perhaps try 1D CNNs. I have found them to be very effective.

Let me know how you go.

can someone tell what is the correct meaning for time_step in numpy.reshape(dataX, (len(dataX), 1, seq_length))

One series is one sample comprised of multiple time steps where an observation is recorded at each time step.

Perhaps this will help:

https://machinelearningmastery.mystagingwebsite.com/reshape-input-data-long-short-term-memory-networks-keras/

thanks for the reply. thanks for the great tutorial

hi Jason, I implemented your code for Stateful LSTM for a One-Char to One-Char Mapping but I was not able to achieve accuracy greater than 32% can you tell me what I might be doing wrong, and also when we are training the model inside ‘for’ loop the statement “model.reset_states()” this will reset the model after the training was done for the current epoch and before the training for the next epoch is done in the ‘for’ loop, so in between this states will be reset and the current states of the LSTM will not be available for the next epoch so it will not behave as a stateful LSTM, can you tell me what I am assuming wrong

That is correct.

I have some suggestions here that may help:

https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

Simplest introduction to LSTM on the internet. Thanks a lot.

Thanks.

Hi, Jason

Thanks for your tutorial, I have one question about the “time-step”. Assuming that I have a dataset with 100 samples in a sequences and it has 10 features in each sample, so I have dataset 100*11 (10 features, 1 dependent variable). If I set time-step =1, (as you used in https://machinelearningmastery.mystagingwebsite.com/multivariate-time-series-forecasting-lstms-keras/) , is that mean that I only predict today’s info by using yesterday’s info? If that is the case, is that mean that I didn’t fully utilize LSTM’s power cause LSTM would be capable to utilize long information?

Thanks.

You can define the input and output any way that you wish for your specific problem.

Hi，can you help explain that what means the LSTM can automatically determine the time lag? And what means the time lag? It is different with the timesteps?

Thanks very much!

It will learn what time steps to pay attention to and in what way in order to make a prediction.

In another post I find it write ” LSTM is well-suited to classify, process and predict time series given time lags of unknown duration.”

Hi Jason，I know you are expert in this fiels, so can you help explain what means”given time lags of unknown duration”

It means that you do not need to analyze and specify the specific lag obs to provide, you just provide sequences of time steps and let the model learn from them.

Generally, LSTMs are poor a time series forecasting though and you might want to look into using CNNs instead.

The “LSTM State Within A Batch” example is a bit misleading here as the samples within a batch are processed in parallel hence LSTM states for each sample are independent and do not really help from sample to sample in any sequential manner. Using shuffle=True in that example achieves the same result.

The main reason why it achieved 100% accuracy on training data is primarily due to the large number of epochs=5000.

The “LSTM State Within A Batch” example confused me too. To me, it is kind of implying that the hidden states and the cell states at the end of one sequence are passed to the start of next sequence if the next sequence is in the same batch. It doesn’t sound right to me because sequences are usually independent and not related to each other even they are in the same batch.

We can choose to reset state at the end of the batch or end of the sample/sequence. We have control.

It can make sense to maintain state across samples/sequences for datasets where there is a ranked order relationship between the sequences.

In the “Stateful LSTM for a One-Char to One-Char Mapping” , when I change the batch size to 5, it gives me an error like:

could not broadcast input array from shape (5,26) into shape (1,26).

Why am I not able to give batch_size=5 or any other number? To what shape am I supposed to change my input array?

You can, but the model will have to be used with the same batch size, even when making a prediction.

I explain this more and the workaround here:

https://machinelearningmastery.mystagingwebsite.com/use-different-batch-sizes-training-predicting-python-keras/

Thanks a lot for the detailed explanation and examples. It’s really helpful for beginners.

You’re very welcome!

Thanks for this good tutorial!

I have one confused part regarding to the normalization as the following.

Once reshaped, we can then normalize the input integers to the range 0-to-1, the range of the sigmoid activation functions used by the LSTM network.

The reason why you need to normalize input as [0, 1] is because range of LSTM’s activation sigmoid function is [0, 1]? and it means all kind of LSTM input needs to be normalized as [0, 1] before using it?

Thanks in advance!

Also, since it normalizes the input at the training, would the normalized also be required for the testing input as well?

Yes.

It is a good practice, unless you change the lstm to use a different activation function, such as relu.

Hi Jason,

Thank you so much for this article! It’s helped me learn a great deal about LSTMs and I refer back to it consistently when working with more complex examples. I found a small bug that I wanted to address.

In the last example (variable length input sequences), you create a char to int mapping from 0 to 25 (as in the previous examples), but then you also pad the sequences. The default padding value for for keras.preprocessing.sequence.pad_sequences is 0.0, so this means that both the letter “A” and the padded characters have a value of zero. So the sequence [‘A’, ‘B’, ‘C’] would be converted to [0, 0, 0, 1, 2], and the sequence [‘B’, ‘C’] would also convert to [0, 0, 0, 1, 2]. Obviously this is incorrect since the padded sequences are indistinguishable from sequences that start with repeated values of the letter ‘A’

A simple fix for this is to create a char to int mapping of the letters of the alphabet from 1 to 26 instead of 0 to 25 (leaving zero to only represent the padding character). The code change looks like this:

# create mapping of characters to integers (1-26) and the reverse

char_to_int = dict((c, i+1) for i, c in enumerate(alphabet))

int_to_char = dict((i+1, c) for i, c in enumerate(alphabet))

A simple change but fixes the issue – it didn’t end up making much of a difference in the model accuracy in this case, but could potentially have a larger effect with other sequences.

Excellent thanks for pointing this out Brian!

without validation, LSTM network would memory all infomation

Yes, it may.

I believe the stateful LSTM is memorizing all the information.

I’ve modified the test case to try to warm up from K, inputting the next outputs in the sequence manually instead of taking the last LSTM output. These are the results which seem to suggest the information is being memorized. (I’ve tried a few others with the same LSTM output, the alphabet in order starting from B)

New start: K

K -> B

L -> C

M -> D

N -> E

O -> F

P -> G

Q -> H

R -> I

S -> J

T -> K

U -> L

V -> M

Do these tests seem reasonable to suggest the following statement in the article walked back?

“We can also see that if we seed the network with the first letter, that it can correctly rattle off the rest of the alphabet.

We can also see that it has only learned the full alphabet sequence and that from a cold start. When asked to predict the next letter from “K” that it predicts “B” and falls back into regurgitating the entire alphabet.”

The article still is very helpful for understanding LSTM features though! I agree that the statement says how a correctly trained LSTM should be trained. I tried rolling the X and y values between epochs, but the loss was not converging (using the same network structure). Any recommendations on how to change the network structure or parameters to prevent this overfitting and train on this example statefully correctly?

Thanks.

Yes, start here:

https://machinelearningmastery.mystagingwebsite.com/start-here/#better

I’m left with a doubt: which kind of tasks strictly require the use of stateful LSTMs?

Great question!

Those where you want fine grained control over when the internal state of the model is reset.

More specifically, when there is a dependency between samples across batches.

In “LSTM with Variable Length Input to One-Char Output” example, will you please review limits defined for “end” variable.

For example if from following line

“end = numpy.random.randint(start, min(start+max_len,len(alphabet)-1)) ”

irandomly selected value is:

” end = 25″ as “len(alphabet)-1 = 25”

then “sequence_out = alphabet[end + 1]” would take 26th character of alphabet, which doesn’t exists.

Sorry, I don’t follow your comment.

Perhaps you can elaborate or restate your question?

Hi Jason, thank you for the great article.

Why would you say does the LSTM learn better when “ABC” is feed into the model as 1 feature of timestep 3 and not 3 features and 1 timestep? Does it have to do with the volume of data? When in general does this approach work better? Reference to any articles on this matter will also be helpful? Thank you in advance.

The framing of a problem that works best depends on the model and dataset.

Hi Jason, Could you please answer below questions

1) stateless LSTM does indeed has state and it is carried over from an sample to subsequent sample with in a batch. Is this correct?

2) In “LSTM with Variable-Length Input to One-Char Output”, you reverted back to stateless LSTM, shuffle is turned on, Will it carry state across samples?

Correct.

Yes.

Hi Jason, your articles are greatly useful.

But I still have some questions about the input features. I also read your other article: https://machinelearningmastery.mystagingwebsite.com/multivariate-time-series-forecasting-lstms-keras/. It use 7 features [pollution, dew, temp, press, wnd_spd, snow, rain] of (t-1) to predict the [pollution] of (t).

As I know, LSTM cell will concat the t-1 output with current t input. Is it mandatory to concat the previous output with the previous input by ourselves? Therefore, you use the pollution of (t-1) [which is the output of (t-2)] as one of the input features. Or Keras LSTM layer will do that automatically?

In another word, Can we just use [dew, temp, press, wnd_spd, snow, rain] of (t-1) to predict pollution of t? Then, the pollution is not included in the input features list.

Actually, I want to use LSTM to predict bus arrival time by input [latitude, longitude, travel time, speed]. I want to whether I need to add the previous arrival time as one of the input?

Thank you!

Thanks!

LSTMs take a sequence of time steps as input. You can define how many time steps are taken as input – it is up to you. You can frame the problem any way you like – and it is a good idea to test many approaches and discover what works well.

LSTM input can be confusing, this explains it the best way I know how:

https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/what-is-the-difference-between-samples-timesteps-and-features-for-lstm-input

I tried to do this, but I still get the error “pandas.errors.ParserError: Error tokenizing data.”

what is the problem please?

Thank you

Sorry to hear that, perhaps some of these tips will help:

https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me

Hello Jason,

thank you so much for providing the best tutorials I ever found on this field!

In the chapter above “LSTM State Within A Batch”:

in the full code example block in line 28, what

do you think of using the padded X-sequence for

reshaping the input rather than using the raw dataX ?

Of course, this doesn’t affect this particular case, but may be more general.

Regards

Harald

Thanks!

Perhaps try padding with a masking layer and compare results.

Hello, your post have greatly assisted with my knowledge and understanding of LSTMs. I did have a quick question because I was attempting to try something similar to this but also incorporating your article using the bidirectional time series forecasting approach. Where a string of letters is the training data set and for prediction is should for example if you have “ABCD” it would predict E or if there is a gap in letters would predict the next one. Similar to how in the time series forecast post in the model after you predicted 70,80,90 you were close to predicting 100. I kept getting an error when attempting to create a prediction function with the demonstration where even after converting X,y from char to ints that Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=2. Full shape received: (None, 3). Because I was used x_input = array([‘B’,’C’, ‘D’]) but unable to get an output. Because the prediction function in this code wasn’t able to predict the value if I take a letter away.

I am not so sure what is your issue. Maybe you try reshape your input before passing on to the network?

My apologies this is the other portion I am mentioning https://machinelearningmastery.mystagingwebsite.com/how-to-develop-lstm-models-for-time-series-forecasting/

Thanks for this tutorial on LSTM. It was a good introduction.

I would like to ask if this is considered data leakage? Essentially the test set is the same as the training set.

Can you point out where is the test set same as the training set? In general, data leakage is about the output indirectly appear in the input, e.g., via transformation or scaling.

Hi Jason,

Thanks a lot for the great post. I have a question regarding whether state between samples is maintained inside a batch for a stateless model.

From “LSTM State Within A Batch” example, you proved the LSTM is able to learn sequence within a batch even it is stateless. But the epoch number for this example is 5000, which is 10 times larger than “Naive LSTM to learn one-char to one-char mapping” example. I tried to train the naive lstm with 5000 epochs, it also gives me 100% acc.

Hi Jiansheng,

You may be working on a regression problem and achieve zero prediction errors.

Alternately, you may be working on a classification problem and achieve 100% accuracy.

This is unusual and there are many possible reasons for this, including:

You are evaluating model performance on the training set by accident.

Your hold out dataset (train or validation) is too small or unrepresentative.

You have introduced a bug into your code and it is doing something different from what you expect.

Your prediction problem is easy or trivial and may not require machine learning.

The most common reason is that your hold out dataset is too small or not representative of the broader problem.

This can be addressed by:

Using k-fold cross-validation to estimate model performance instead of a train/test split.

Gather more data.

Use a different split of data for train and test, such as 50/50.

I follow your channel. It is a really valuable channel for us. I subscribed and recommended it to my friends. You explained the LSTM topics very well, thank you. I have a question. I am working on the estimation of the coordinate viewed on the PC screen using eye images with a webcam. I was able to achieve this prediction thing using CNN. For now, I’m working on how I can do this with LSTM.

I have the coordinate information of the point of view on the PC screen as opposed to the eye images. I get features for each image from eye images with CNN. I created an end-to-end model by passing these features through LSTM.

————-

First I take the eye images in arrays of 40 each using CNN distributed time and flatten them in the last layer of the CNN. flatten = (40, 5120)

Then I export the flatten result to LSTM and train it end-to-end.

——

My question is,

1- Do I need to shift the target coordinate data by 1 step?

2- Using only 1 image with Model.predict, it gives a size error when asking for coordinate estimation on the screen. I think this error is because 40 images are used in the training, so the model still asks for 40 images. How can I achieve this using 1 image. As far as I understand on the internet, it says stateful=true state information is required under the solution for this problem. I would be glad if you help.

Hi Furkan…based upon your description, you may benefit from CNN+LSTM models.

https://machinelearningmastery.mystagingwebsite.com/cnn-long-short-term-memory-networks/

> Once reshaped, you can then normalize the input integers to the range 0-to-1, the range of the sigmoid activation functions used by the LSTM network.

It appears the default activation for LSTM in Keras is

`tanh`

, corresponding to a range of -1-to-1. The sigmoid activation, while mentioned above, is not specified when creating the LSTM units. So it would seem that either the sigmoid activation function should be specified when creating the LSTM units, or the data should be scaled to the -1-to-1 range. Unless I am missing something. Thank you for the post.Hi Corey…The ReLU activation function may prove beneficial as well:

https://machinelearningmastery.mystagingwebsite.com/rectified-linear-activation-function-for-deep-learning-neural-networks/

Using only 1 image with Model.predict, it gives a size error when asking for coordinate estimation on the screen. I think this error is because 40 images are used in the training, so the model still asks for 40 images. How can I achieve this using 1 image. As far as I understand on the internet, it says stateful=true state information is required under the solution for this problem. I would be glad if you help.