Text Generation With LSTM Recurrent Neural Networks in Python with Keras

Recurrent neural networks can also be used as generative models.

This means that in addition to being used for predictive models (making predictions) they can learn the sequences of a problem and then generate entirely new plausible sequences for the problem domain.

Generative models like this are useful not only to study how well a model has learned a problem, but to learn more about the problem domain itself.

In this post you will discover how to create a generative model for text, character-by-character using LSTM recurrent neural networks in Python with Keras.

After reading this post you will know:

  • Where to download a free corpus of text that you can use to train text generative models.
  • How to frame the problem of text sequences to a recurrent neural network generative model.
  • How to develop an LSTM to generate plausible text sequences for a given problem.

Let’s get started.

Note: LSTM recurrent neural networks can be slow to train and it is highly recommend that you train them on GPU hardware. You can access GPU hardware in the cloud very cheaply using Amazon Web Services, see the tutorial here.

  • Update Oct/2016: Fixed a few minor comment typos in the code.
  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0.
Text Generation With LSTM Recurrent Neural Networks in Python with Keras

Text Generation With LSTM Recurrent Neural Networks in Python with Keras
Photo by Russ Sanderlin, some rights reserved.

Problem Description: Project Gutenberg

Many of the classical texts are no longer protected under copyright.

This means that you can download all of the text for these books for free and use them in experiments, like creating generative models. Perhaps the best place to get access to free books that are no longer protected by copyright is Project Gutenberg.

In this tutorial we are going to use a favorite book from childhood as the dataset: Alice’s Adventures in Wonderland by Lewis Carroll.

We are going to learn the dependencies between characters and the conditional probabilities of characters in sequences so that we can in turn generate wholly new and original sequences of characters.

This is a lot of fun and I recommend repeating these experiments with other books from Project Gutenberg, here is a list of the most popular books on the site.

These experiments are not limited to text, you can also experiment with other ASCII data, such as computer source code, marked up documents in LaTeX, HTML or Markdown and more.

You can download the complete text in ASCII format (Plain Text UTF-8) for this book for free and place it in your working directory with the filename wonderland.txt.

Now we need to prepare the dataset ready for modeling.

Project Gutenberg adds a standard header and footer to each book and this is not part of the original text. Open the file in a text editor and delete the header and footer.

The header is obvious and ends with the text:

The footer is all of the text after the line of text that says:

You should be left with a text file that has about 3,330 lines of text.

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Develop a Small LSTM Recurrent Neural Network

In this section we will develop a simple LSTM network to learn sequences of characters from Alice in Wonderland. In the next section we will use this model to generate new sequences of characters.

Let’s start off by importing the classes and functions we intend to use to train our model.

Next, we need to load the ASCII text for the book into memory and convert all of the characters to lowercase to reduce the vocabulary that the network must learn.

Now that the book is loaded, we must prepare the data for modeling by the neural network. We cannot model the characters directly, instead we must convert the characters to integers.

We can do this easily by first creating a set of all of the distinct characters in the book, then creating a map of each character to a unique integer.

For example, the list of unique sorted lowercase characters in the book is as follows:

You can see that there may be some characters that we could remove to further clean up the dataset that will reduce the vocabulary and may improve the modeling process.

Now that the book has been loaded and the mapping prepared, we can summarize the dataset.

Running the code to this point produces the following output.

We can see that the book has just under 150,000 characters and that when converted to lowercase that there are only 47 distinct characters in the vocabulary for the network to learn. Much more than the 26 in the alphabet.

We now need to define the training data for the network. There is a lot of flexibility in how you choose to break up the text and expose it to the network during training.

In this tutorial we will split the book text up into subsequences with a fixed length of 100 characters, an arbitrary length. We could just as easily split the data up by sentences and pad the shorter sequences and truncate the longer ones.

Each training pattern of the network is comprised of 100 time steps of one character (X) followed by one character output (y). When creating these sequences, we slide this window along the whole book one character at a time, allowing each character a chance to be learned from the 100 characters that preceded it (except the first 100 characters of course).

For example, if the sequence length is 5 (for simplicity) then the first two training patterns would be as follows:

As we split up the book into these sequences, we convert the characters to integers using our lookup table we prepared earlier.

Running the code to this point shows us that when we split up the dataset into training data for the network to learn that we have just under 150,000 training pattens. This makes sense as excluding the first 100 characters, we have one training pattern to predict each of the remaining characters.

Now that we have prepared our training data we need to transform it so that it is suitable for use with Keras.

First we must transform the list of input sequences into the form [samples, time steps, features] expected by an LSTM network.

Next we need to rescale the integers to the range 0-to-1 to make the patterns easier to learn by the LSTM network that uses the sigmoid activation function by default.

Finally, we need to convert the output patterns (single characters converted to integers) into a one hot encoding. This is so that we can configure the network to predict the probability of each of the 47 different characters in the vocabulary (an easier representation) rather than trying to force it to predict precisely the next character. Each y value is converted into a sparse vector with a length of 47, full of zeros except with a 1 in the column for the letter (integer) that the pattern represents.

For example, when “n” (integer value 31) is one hot encoded it looks as follows:

We can implement these steps as below.

We can now define our LSTM model. Here we define a single hidden LSTM layer with 256 memory units. The network uses dropout with a probability of 20. The output layer is a Dense layer using the softmax activation function to output a probability prediction for each of the 47 characters between 0 and 1.

The problem is really a single character classification problem with 47 classes and as such is defined as optimizing the log loss (cross entropy), here using the ADAM optimization algorithm for speed.

There is no test dataset. We are modeling the entire training dataset to learn the probability of each character in a sequence.

We are not interested in the most accurate (classification accuracy) model of the training dataset. This would be a model that predicts each character in the training dataset perfectly. Instead we are interested in a generalization of the dataset that minimizes the chosen loss function. We are seeking a balance between generalization and overfitting but short of memorization.

The network is slow to train (about 300 seconds per epoch on an Nvidia K520 GPU). Because of the slowness and because of our optimization requirements, we will use model checkpointing to record all of the network weights to file each time an improvement in loss is observed at the end of the epoch. We will use the best set of weights (lowest loss) to instantiate our generative model in the next section.

We can now fit our model to the data. Here we use a modest number of 20 epochs and a large batch size of 128 patterns.

The full code listing is provided below for completeness.

You will see different results because of the stochastic nature of the model, and because it is hard to fix the random seed for LSTM models to get 100% reproducible results. This is not a concern for this generative model.

After running the example, you should have a number of weight checkpoint files in the local directory.

You can delete them all except the one with the smallest loss value. For example, when I ran this example, below was the checkpoint with the smallest loss that I achieved.

The network loss decreased almost every epoch and I expect the network could benefit from training for many more epochs.

In the next section we will look at using this model to generate new text sequences.

Generating Text with an LSTM Network

Generating text using the trained LSTM network is relatively straightforward.

Firstly, we load the data and define the network in exactly the same way, except the network weights are loaded from a checkpoint file and the network does not need to be trained.

Also, when preparing the mapping of unique characters to integers, we must also create a reverse mapping that we can use to convert the integers back to characters so that we can understand the predictions.

Finally, we need to actually make predictions.

The simplest way to use the Keras LSTM model to make predictions is to first start off with a seed sequence as input, generate the next character then update the seed sequence to add the generated character on the end and trim off the first character. This process is repeated for as long as we want to predict new characters (e.g. a sequence of 1,000 characters in length).

We can pick a random input pattern as our seed sequence, then print generated characters as we generate them.

The full code example for generating text using the loaded LSTM model is listed below for completeness.

Running this example first outputs the selected random seed, then each character as it is generated.

For example, below are the results from one run of this text generator. The random seed was:

The generated text with the random seed (cleaned up for presentation) was:

We can note some observations about the generated text.

  • It generally conforms to the line format observed in the original text of less than 80 characters before a new line.
  • The characters are separated into word-like groups and most groups are actual English words (e.g. “the”, “little” and “was”), but many do not (e.g. “lott”, “tiie” and “taede”).
  • Some of the words in sequence make sense(e.g. “and the white rabbit“), but many do not (e.g. “wese tilel“).

The fact that this character based model of the book produces output like this is very impressive. It gives you a sense of the learning capabilities of LSTM networks.

The results are not perfect. In the next section we look at improving the quality of results by developing a much larger LSTM network.

Larger LSTM Recurrent Neural Network

We got results, but not excellent results in the previous section. Now, we can try to improve the quality of the generated text by creating a much larger network.

We will keep the number of memory units the same at 256, but add a second layer.

We will also change the filename of the checkpointed weights so that we can tell the difference between weights for this network and the previous (by appending the word “bigger” in the filename).

Finally, we will increase the number of training epochs from 20 to 50 and decrease the batch size from 128 to 64 to give the network more of an opportunity to be updated and learn.

The full code listing is presented below for completeness.

Running this example takes some time, at least 700 seconds per epoch.

After running this example you may achieved a loss of about 1.2. For example the best result I achieved from running this model was stored in a checkpoint file with the name:

Achieving a loss of 1.2219 at epoch 47.

As in the previous section, we can use this best model from the run to generate text.

The only change we need to make to the text generation script from the previous section is in the specification of the network topology and from which file to seed the network weights.

The full code listing is provided below for completeness.

One example of running this text generation script produces the output below.

The randomly chosen seed text was:

The generated text with the seed (cleaned up for presentation) was :

We can see that generally there are fewer spelling mistakes and the text looks more realistic, but is still quite nonsensical.

For example the same phrases get repeated again and again like “said to herself” and “little“. Quotes are opened but not closed.

These are better results but there is still a lot of room for improvement.

10 Extension Ideas to Improve the Model

Below are 10 ideas that may further improve the model that you could experiment with are:

  • Predict fewer than 1,000 characters as output for a given seed.
  • Remove all punctuation from the source text, and therefore from the models’ vocabulary.
  • Try a one hot encoded for the input sequences.
  • Train the model on padded sentences rather than random sequences of characters.
  • Increase the number of training epochs to 100 or many hundreds.
  • Add dropout to the visible input layer and consider tuning the dropout percentage.
  • Tune the batch size, try a batch size of 1 as a (very slow) baseline and larger sizes from there.
  • Add more memory units to the layers and/or more layers.
  • Experiment with scale factors (temperature) when interpreting the prediction probabilities.
  • Change the LSTM layers to be “stateful” to maintain state across batches.

Did you try any of these extensions? Share your results in the comments.


This character text model is a popular way for generating text using recurrent neural networks.

Below are some more resources and tutorials on the topic if you are interested in going deeper. Perhaps the most popular is the tutorial by Andrej Karpathy titled “The Unreasonable Effectiveness of Recurrent Neural Networks“.


In this post you discovered how you can develop an LSTM recurrent neural network for text generation in Python with the Keras deep learning library.

After reading this post you know:

  • Where to download the ASCII text for classical books for free that you can use for training.
  • How to train an LSTM network on text sequences and how to use the trained network to generate new sequences.
  • How to develop stacked LSTM networks and lift the performance of the model.

Do you have any questions about text generation with LSTM networks or about this post? Ask your questions in the comments below and I will do my best to answer them.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.

157 Responses to Text Generation With LSTM Recurrent Neural Networks in Python with Keras

  1. Avi Levy August 12, 2016 at 10:33 am #

    Great post. Thanks!

    • Jason Brownlee August 15, 2016 at 12:29 pm #

      You’re welcome Avi.

    • Shreyas September 2, 2017 at 3:03 am #

      when i try to run the codes, I get an error with the weights file.
      ValueError: Dimension 1 in both shapes must be equal, but are 52 and 44 for Assign_13 with input shapes [256,52], [256,44].
      Can you please let me know what is happening

      • Jason Brownlee September 2, 2017 at 6:15 am #

        Confirm that you have copied and run all of the code and that your environment and libraries are all up to date.

        • Shreyas Becker Lalitha Venkatramanan September 3, 2017 at 10:15 am #

          Hi Jason,
          Just updated all libraries. Still getting the same error.

          • Jason Brownlee September 3, 2017 at 3:43 pm #

            Sorry, it is not clear to me what your fault could be.

          • Shreyas Becker Lalitha Venkatramanan September 3, 2017 at 8:21 pm #

            Thanks. I got it to work now!!! But I am getting random results will work on it further. But i feel like it is a good start. Thanks for the codes!

          • Jason Brownlee September 4, 2017 at 4:30 am #

            Well done on your progress, hang in there!

  2. Max August 16, 2016 at 3:30 pm #

    I’m excited to try combining this with nltk… it’s probably going to be long and frustrating, but I’ll try to let you know my results.

    Thanks for sharing!

    • Jason Brownlee August 17, 2016 at 9:49 am #

      Good luck Max, report back and let us know how you go.

    • Al March 26, 2017 at 7:03 am #

      Hi Max,

      Did you get anywhere with combining NLTK with this?
      Just curious…

  3. Sviat September 10, 2016 at 10:23 pm #

    I really like your article, thank you for sharing. I can launch your code but I have a crash after finalization of 1st epoch:
    Traceback (most recent call last):
    File “train.py”, line 61, in
    model.fit(X, y, nb_epoch=20, batch_size=128, callbacks=callbacks_list)
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/models.py”, line 620, in fit
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/engine/training.py”, line 1104, in fit
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/engine/training.py”, line 842, in _fit_loop
    callbacks.on_epoch_end(epoch, epoch_logs)
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/callbacks.py”, line 40, in on_epoch_end
    callback.on_epoch_end(epoch, logs)
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/callbacks.py”, line 296, in on_epoch_end
    self.model.save(filepath, overwrite=True)
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/engine/topology.py”, line 2427, in save
    save_model(self, filepath, overwrite)
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/models.py”, line 56, in save_model
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/keras/engine/topology.py”, line 2476, in save_weights_to_hdf5_group
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/h5py/_hl/group.py”, line 108, in create_dataset
    self[name] = dset
    File “_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper (/scratch/pip_build_sbilokin/h5py/h5py/_objects.c:2513)
    File “_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper (/scratch/pip_build_sbilokin/h5py/h5py/_objects.c:2466)
    File “/afs/in2p3.fr/home/s/sbilokin/.local/lib/python2.7/site-packages/h5py/_hl/group.py”, line 277, in __setitem__
    h5o.link(obj.id, self.id, name, lcpl=lcpl, lapl=self._lapl)
    File “_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper (/scratch/pip_build_sbilokin/h5py/h5py/_objects.c:2513)
    File “_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper (/scratch/pip_build_sbilokin/h5py/h5py/_objects.c:2466)
    File “h5o.pyx”, line 202, in h5py.h5o.link (/scratch/pip_build_sbilokin/h5py/h5py/h5o.c:3726)

    RuntimeError: Unable to create link (Name already exists)
    The file names are unique, probably there is a collision of names in keras or h5py for me.
    Could you help me please?

    • Sviat September 11, 2016 at 10:15 am #

      Found my mistake, I didn’t edit topology.py in keras correctly to fix another problem.

      This is my second attempt to study NN, but I always have problems with versions, errors, dependencies and this scares me away.
      For example now I have a problem to load the weights, using the example above on python 3 with intermediate weight files:

      Traceback (most recent call last):
      File “bot.py”, line 49, in
      File “.local/lib/python3.3/site-packages/keras/engine/topology.py”, line 2490, in load_weights
      File “.local/lib/python3.3/site-packages/keras/engine/topology.py”, line 2533, in load_weights_from_hdf5_group
      weight_names = [n.decode(‘utf8’) for n in g.attrs[‘weight_names’]]
      File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper (/scratch/pip_build_/h5py/h5py/_objects.c:2691)
      File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper (/scratch/pip_build_/h5py/h5py/_objects.c:2649)
      File “/.local/lib/python3.3/site-packages/h5py/_hl/attrs.py”, line 58, in __getitem__
      attr = h5a.open(self._id, self._e(name))
      File “h5py/_objects.pyx”, line 54, in h5py._objects.with_phil.wrapper (/scratch/pip_build_/h5py/h5py/_objects.c:2691)
      File “h5py/_objects.pyx”, line 55, in h5py._objects.with_phil.wrapper (/scratch/pip_build_/h5py/h5py/_objects.c:2649)
      File “h5py/h5a.pyx”, line 77, in h5py.h5a.open (/scratch/pip_build_/h5py/h5py/h5a.c:2179)
      KeyError: “Can’t open attribute (Can’t locate attribute)”

      • Jason Brownlee September 12, 2016 at 8:26 am #

        I don’t think I can give you good advice if you are modifying the Keras framework files.

        Good luck!

  4. Alex September 14, 2016 at 11:46 pm #

    Hi Jason,

    I followed your tutorial and then built my own sequence-to-sequence model, trained at word level.
    Might soon share results and code, but wanted first to thank you for the great post, helped me a lot getting started with Keras.

    Keep up the amazing work!

    • Jason Brownlee September 15, 2016 at 8:22 am #

      Great, well done Alex. It would be cool if you can post or link to your code.

  5. Nader September 22, 2016 at 2:15 am #

    When I try to get the unique set of characters:
    I get the following:
    [‘\n’, ‘ ‘, ‘!’, ‘”‘, “‘”, ‘(‘, ‘)’, ‘*’, ‘,’, ‘-‘, ‘.’, ‘:’, ‘;’, ‘?’, ‘[‘, ‘]’, ‘_’, ‘a’, ‘b’, ‘c’, ‘d’, ‘e’, ‘f’, ‘g’, ‘h’, ‘i’, ‘j’, ‘k’, ‘l’, ‘m’, ‘n’, ‘o’, ‘p’, ‘q’, ‘r’, ‘s’, ‘t’, ‘u’, ‘v’, ‘w’, ‘x’, ‘y’, ‘z’, ‘\xbb’, ‘\xbf’, ‘\xef’]

    Note, that the ‘\r’ is missing, why ?

    Thank you

    • Jason Brownlee September 22, 2016 at 8:20 am #

      It is not needed on some platforms, like Unix and friends. Only windows uses CRLF.

      • Nader September 22, 2016 at 11:44 am #

        Thank you for the reply.
        I am using windows.
        I am running the fitting as I type.

        • Jason Brownlee September 22, 2016 at 5:29 pm #


          • Nader September 22, 2016 at 11:28 pm #

            Thank you for your reply, Jason.

            Can I generate a text book the size of Alice in Wonderland using the same technique ?

            And if so, how ?

            Do I generate 50,000 characters for example ?

            And how do I use a “SEED” to actually generate such a text ?

            In the example you are using a 100 characters as a way to breakup the text and expose it to the network. DOES increasing the characters help with producing a more meaningful text ?

            And another question ?

            How many epochs should I run the fitting ? 100, 200 ? Because the loss keeps decreasing, but if it gets close to zero is that a good thing ?

            Sorry for so many questions.

            Thank you VERY VERY MUCH 🙂

          • Jason Brownlee September 23, 2016 at 8:27 am #

            Great questions, but these are research questions.

            You would have to experiment and see.

    • Gustavo Führ March 21, 2017 at 5:49 am #

      I did a similar project, and I removed non ASCII characters using:

      ascii_values = ascii_values[np.where(ascii_values < 128)]

  6. Landon September 23, 2016 at 2:57 am #

    Hey, nice article. Can you explain why you are using 256 as your output dimension for LSTM? Does the reasoning for 256 come from other parameters?

    • Jason Brownlee September 23, 2016 at 8:29 am #

      The network only outputs a single character.

      There are 256 nodes in the LSTM layers, I chose a large number to increase the representational capacity of the network. I chose it arbitrarily. You could experiment with different values.

  7. Lino October 24, 2016 at 3:15 pm #

    awesome explanation. Thanks for sharing

  8. Shamit October 27, 2016 at 7:18 am #

    Thanks for the nice article. Why do we have to specify the number of times-teps beforehand in LSTM ( in input_shape)? What if different data points have different number of sequence lengths? How can we make keras lstm work in that case?

    • Jason Brownlee October 27, 2016 at 7:49 am #

      Hi Shamit,

      The network needs to know the size of data so that it can prepare efficient data structures (backend computation) for the model.

      If you have variable length sequences, you will need to pad them with zeros. Keras offers a padding function for this purpose:

  9. Julian October 27, 2016 at 6:50 pm #

    Hi Jason

    Thanks a lot for the code and the easy explanations.

    Only a short notice: For the model checkpoints you will need the h5py module which was not preinstalled with my python. I obviously only noticed it after the first epoch, when keras tried to import it. Might be a good idea for people to check before they waste time running the code on slow hardware like I did 🙂

  10. Bruce November 13, 2016 at 7:07 pm #

    Hi Jason

    I ran the exact same code of your small LSTM network on the same book(wonderland.txt) for 20 epochs. But the generated text in my case is not making any sense. A snippet of it is as below:

    ” e chin.

    æiÆve a right to think,Æ said alice sharply, for she was beginning to
    feel a little worried ”



    Could you give some insights on why this is happening and how to fix it?

    • Jason Brownlee November 14, 2016 at 7:43 am #

      Ouch, something is going on there Bruce.

      Perhaps confirm Keras 1.1.0 and TensorFlow 0.10.

      It looks like maybe the network was not trained for long enough or the character conversion from ints is not working out.

      Also, conform that there were no copy-paste errors.

  11. Keerthimanu November 24, 2016 at 10:19 am #

    How to take the seed from user instead of program generating the random text ?

    • Jason Brownlee November 24, 2016 at 10:45 am #

      You can read from sys.stdin, for example:

    • Srce Cde (Chirag) August 30, 2017 at 8:15 pm #

      For user input you can use userinput = stdin.readline()

      For ex the input is:

      userinput = “Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to”

      sort_sen = sorted(list(p))
      pattern = [char_to_int[value.lower()] for value in sort_sen]

      This is how you can deal with user input. Hope this helps.

  12. Victor November 27, 2016 at 6:28 pm #

    Hi Jason,
    Thank you so much for the great post!
    Can you please explain the need to reshape the X ?
    What is wrong with the initial shape of list of lists?
    Thank you!

    • Jason Brownlee November 28, 2016 at 8:42 am #

      Hi Victor, great question.

      All LSTM input must be in the form [samples, timesteps, features]. The data was loaded as [samples, timesteps]. Rather than timesteps, think of sequence – it’s the same thing.

      I hope that helps.

  13. ATM November 30, 2016 at 1:02 pm #

    Using a similar approach: Can one generate numeric sequences from time series data, much like sentences? I don’t see a reason why we can’t. Any input is appreciated, Jason. thanks.

    • Jason Brownlee December 1, 2016 at 7:15 am #

      For sure, change the output from one node to n-nodes.

      For sequence output, you’ll need a lot more data and a lot more training.

      • ATM December 4, 2016 at 2:25 pm #

        Thank you, but why should we change to n-nodes? Considering you generated a sequence of text, can’t I get a numeric sequence with the same 1 node setup?

        Also, I don’t understand “index = numpy.argmax(prediction)
        result = int_to_char[index]” part of the code. Can you please explain why its necessary?

        I’m new to your website….Keep up the great work! Looking to hear from you.

        • Jason Brownlee December 5, 2016 at 6:49 am #

          Hi ATM,

          Yes, there are a few ways to frame a sequence prediction problem. You can use a one-step prediction model many times. Another approach is to predict part or the entire sequence at once, and all the levels in between.

          The line:

          Selects the node with the largest output value. The index of the node is the same as the index of the class to predict.

          I hope that helps.

          • ATM December 7, 2016 at 5:31 pm #

            Thanks for the clarification, Jason: I got the code running with decent predictions for time series data.

            I think you might want to add in the tutorial that prediction usually works well for only somewhat statistically stationary datasets, regardless of training size?

            I’ve tried it on both stationary and non-stationary, and I’ve come to this conclusion (which makes sense). Very often, time series collected from both financial and scientific datasets are not stationary, so LSTM has to be used very conservatively.

          • Jason Brownlee December 8, 2016 at 8:15 am #

            Thanks ATM! I agree, I’ll be going into a lot more details on stationary time series and making non-stationary data stationary in coming blog posts.

  14. Sban December 6, 2016 at 2:16 am #

    Hi jason,
    I have been running this. But thebloss instead of decreasing is always increasing. So far, i ran it for 20 epochs. Do you have any idea what might be the case. I didn’t change anything in the program, though.


    • Jason Brownlee December 6, 2016 at 9:53 am #

      Maybe your network is too small? Maybe overlearning?

  15. ben December 6, 2016 at 2:59 am #

    how would you handle a much bigger dataset ? I scraped all public declarations of the French goverment for the past few years so my corpus is Total Characters: 465163150
    Total Vocab: 146
    the dataX and dataY will likely crash due to size constraints, how could I circumvent that issue and use the full dataset ?

    • Jason Brownlee December 6, 2016 at 9:54 am #

      Perhaps read the data in batches from disk – I believe Keras has this capability for images (data generator), perhaps for text too? I’m not sure off hand.

  16. Julian January 2, 2017 at 5:10 pm #

    Hi Jason,

    Thanks for the example. I wonder about the loss function: categorical cross-entropy. I tried to find the source code but was not successful with it. Do you know what a loss of 1.2 actually means? Is there a unit to this number?

    From my understanding, the goal is to somehow get the network to learn a probability distribution for the predictions as similar as possible to the one of the training data. But how similar is 1.2?

    Obviously a loss of 0 would mean that the network could accurately predict the target to any given sample with 100% accuracy which is already quite difficult to imagine as the output to this network is not binary but rather a softmax over an array with values for all characters in the vocabulary.

    Thanks again.

  17. Eyal January 11, 2017 at 5:59 pm #

    Hey, thanks for the post, will this work on Theano instead of tensorflow? I understand Keras can work with both, and I am having difficulties installing tensorflow on my mac

    • Jason Brownlee January 12, 2017 at 9:25 am #

      Yes, the code will work on either backend.

      • Eyal January 14, 2017 at 1:01 am #

        Thanks, TensorFlow worked after all. I am running the training on a big data like the book with these messages:
        Total Characters: 237084
        Total Vocab: 50
        Total Patterns: 236984

        on Epoche 13, loss is 1.5. I had to stop it there since it took my MacBook Pro 20 hours to create 13 epoches. Anyways, this is my result, notice it starts to repeat itself at a certain point:

        is all the love and the sain on my baby

        bnd i love you to be a line to be a line

        to be a line to me, i’m she way the lov

        es me down and i do and i do and i do an

        d i do and i do and i do and i do and i

        do and i do and i do and i do and i do a

        nd i do and i do and i do and i do and i

        do and i do and i do and i do and i do

        and i do and i do and i do and i do and

        This is another result from another run of the program:

        oe mine i want you to ban tee the sain o

        f move and i do and i do and i love you

        to be a line to me, i’m she way the way

        the way the way the way the way the way

        the way the way the way the way the way

        the way the way the way the way the way

        the way the way the way the way the way

        the way the way the way the way the way

        This is pretty frustrating (while still incredibly awesome), do you have an idea why this should happen?

        Regardless, this is such a great post that gives access to RNN and LSTM in a great nature!

  18. Jack January 15, 2017 at 8:15 am #

    Would we get better result if we trained the network at the word level instead of character? We would split the dataset into words, index them, the same way we do with the characters, and use them as input. Wouldn’t that at least eliminate the made up words in the generated text making it seem more plausible?

    • Jason Brownlee January 16, 2017 at 10:35 am #

      Yep. Try it out Jack! I’d love to see the result.

  19. Fateh January 18, 2017 at 12:51 am #

    Is there any benchmark dataset for this task, to actually evaluate the model ?

  20. Don January 20, 2017 at 10:34 am #

    Hi Jason amazing post! I have a doubt. I tried the following: instead of training with Alice’s Adventures, I train with this a list of barcodes. Unique barcodes in a plain text (example, 123455, 143223, etc etc) and they can be of different lengths. Is this approach still valid? What i want to do is to input a potentially barcode with errors (maybe a character is missing) and the LSTM returns me “the correct one”. Any suggestion? Many thanks in advance!

    • Jason Brownlee January 21, 2017 at 10:22 am #

      Nice idea Don.

      LSTMs can indeed correct sequences, but they would need to memorize all of the barcodes. You would then have to train it to fill in the gaps (0) with the missing values.

      Very nice idea!

      You could also try uses an autoencoder to do the same trick.

      I’d love to hear how you go.

      • Don January 21, 2017 at 10:58 pm #

        Thanks Jason! For sure i will come back 🙂

        Do you have any suggested read for either LSTM as a “word corrector” or an autoencoder for that task?


        • Jason Brownlee January 22, 2017 at 5:12 am #

          Sorry I don’t Don. Good luck with your project!

  21. Alex Bing January 20, 2017 at 8:56 pm #

    Hi Jason, thank you very much for your post.

    I was thinking about changing the input and output to be a coordinate, which is 2D vector (x and y position) to predict movement.

    I understood that we can change the input to be a vector instead of scalar. For example using one hot vector like you’re suggesting.

    However, I don’t understand what to do when we want the output to be a 2D vector. Does this mean that I don’t need to use softmax layer as the output?

    Thanks for your help.

    • Jason Brownlee January 21, 2017 at 10:29 am #

      Hi Alex, interesting idea – movement generator.

      You can output a vector by changing the number of neurons in the output layer.

  22. Ashok Kumar January 28, 2017 at 3:34 am #

    Hi Jason,

    The features in the exercise are characters which are mapped to integers. It’s like replacing a nominal vector by a continuous variable. Isn’t that an issue?

    Also, would you consider using Embedding as the first layer in this model – why or why not.

    Thank you, your posts are immensely helpful.

    • Jason Brownlee January 28, 2017 at 7:53 am #

      Hi Ashok,

      The input data is ordinal, either as chars or ints. The neural net expects to work with arrays of numbers instead of chars, so we use ints here.

      An embedding layer would create a projection for the chars into a higher dimensional space. That would not be useful here as we are trying to learn and generalize the way sequences of chars are put together.

      I could be wrong though, try it and see if you can make it work – perhaps with projections of word sequences?

      • Ashok January 28, 2017 at 8:34 am #

        Thanks Jason. This is helpful.

        On the embedding layer, I have another question – how are the projections learnt? Is it a simple method like multiplying with a known matrix or are they learnt iteratively? If iteratively, then are they learnt a) first before the other weights or b) are the embedding projection weights learnt along with the other weights.


      • Ashok January 29, 2017 at 3:45 am #


        The below code for text generation is treating chars as nominal variables and hence giving assigning a separate dimension for each. Since this is a more complex approach, I will check if this leads to an improvement.

        Again, thanks for your response. It only added to my curiosity .


        • Jason Brownlee February 1, 2017 at 10:11 am #

          I’d love to hear how the representational approaches compare in practice Ashok.

      • Konstantin March 16, 2017 at 7:55 am #

        Jason, could you explain please how the input data is ordinal? Instead of one-hot encoding we simply enumerate characters.

        • Pasquinell August 2, 2017 at 9:18 pm #

          I have the same question. Have you found any answer?

          • Jason Brownlee August 3, 2017 at 6:51 am #

            If your input is text, you can use:

            – an integer encoding of the sequence
            – a bag of word encoding (e.g. counts of tf-idf)
            – one hot encoding of integer encoding
            – word embedding (word2vec, glove or learned)

            Does that help?

  23. Hendrik March 3, 2017 at 8:24 pm #

    Hello Jason!

    Nice work. But I don’t really understand what is the point of applying RNN for this particular task. If one wants to predict the next character of a 100-long sequence, this is seems me achievable with any other common regression task where you feed in the training sequences and the following single character as output while training. What is the additional feat of your approach?

    • Jason Brownlee March 6, 2017 at 10:47 am #

      Yes, here are many ways to approach this problem.

      LSTMs are supposed to particularly good at modeling the underlying PDF of char or words in a text corpus.

  24. Hendrik March 7, 2017 at 9:06 pm #

    The example doesn’t run anymore under TensorFlow 1.0.0. (previous versions are OK, at least with 0.10.0rc0).

  25. Gustavo Führ March 21, 2017 at 6:01 am #

    Great post Jason, I was trying to do almost the same thing and your post gave a lot of help. As some of the comments suggest, the RNN seems to achieve a loop quite quickly. I was wondering what you meant with “Add dropout to the visible input layer and consider tuning the dropout percentage.”, seems it appears to attack this problem.

    • Jason Brownlee March 21, 2017 at 8:44 am #

      You can apply regularization to the LSTMs on recurrent and input connections.

      I have found adding a drop out of 40% or so on input connections for LSTMs very useful on other projects.

      More details here:

      • Gustavo Führ March 23, 2017 at 9:31 am #

        Actually Jason,

        I advanced my experiments and found some interesting things, I will probably do a Medium about it.

        For the loop problem, as mentioned in http://karpathy.github.io/2015/05/21/rnn-effectiveness/, makes no sense to always use the argmax in the generation. Since the network output a list of probabilities is easy to randomically sample letter using this distribution:

        def sample_prediction(char_map, prediction):
        rnd_idx = np.random.choice(len(prediction), p=prediction)
        return char_map[rnd_idx]

        That simple change made the network to avoid loops and be much more diverse. (Using
        temperature as you mentioned will be even better)

        Here is a sample of a one-layer network (same as yours), trained with Cervante’s Dom
        Quixote, for 20 epochs:

        from its beginning to its end, and above all manner; as so able to be almost unable to repeat the, which he had protected him to leave it. As now, for her face because I will be, that he ow her its missing their resckes my eyes, proved
        for the descance in the mouth as saying, to eat for a
        shepherdess and this knight ais, they
        did so that was rockly than to possess the new lose of, that in a sword of now another
        golden without a change on which not this; if shore of the Micomicona his dame, ‘To know some little why as capacquery; I will make any ammorimence is seen in their
        soicour and the adventure of her damsel a pearls that shows and vonquished callshind him, away itself, her fever and evil comrisoness by where they
        will show not in time in which all the chain
        himself of the
        solmings of the hores of this
        your nyiugh and it,
        should have punisented, not to portion for, as it
        just be-with his sort rich as the shaken of the
        sun to
        three lady male love?”

        Am I he will not believe. I will disreel her more so knight, for

        Isn’t that cool? I hope that helps other people.

        • Jason Brownlee March 24, 2017 at 7:51 am #

          Very cool Gustavo, thanks for sharing!

        • Jatin March 31, 2017 at 10:12 am #

          First of all big thanks to Jason for such a valuable write up.

          I have 1 question :

          To Gustavo Führ and Jason Brownlee

          Could you please explain in a little more detail what you did there. i didn’t understand, how you got such a good output on a single layer.

          I ran my training on this book as my data -> http://www.gutenberg.org/cache/epub/5200/pg5200.txt

          See below the output i am getting..

          Seed :
          see it coming. we can’t all work as hard as we have to and then come hometo be tortured like this, we can’t endure it. i can’t endure it anymore.” and she broke out so heavily in tears that they flo
          Prediction :
          led to het hemd and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and she was so the door and

          As you can see here a text is repeating itself. This the result of 20th epoch out of 50 epochs in a dual layer LSTM with 256 cells and a sequence of 200 characters.

          This is a result pattern i am seeing in all my trainings and i believe i am doing something wrong here. Could you please help out?

  26. Henok S Mengistu April 5, 2017 at 8:32 am #

    what is the need for normalization

    # normalize
    X = X / float(n_vocab)

    • Jason Brownlee April 9, 2017 at 2:32 pm #

      This normalizes the integer values by the largest integer value. All values are rescaled to 0-1.

      • Eric September 12, 2017 at 4:59 pm #

        Is this because it’s using a relu activation function? Or because generally you need input values to be between 0-1?

        • Jason Brownlee September 13, 2017 at 12:29 pm #

          LSTMs prefer normalized inputs – confirm this with some experiments (I have).

  27. June April 11, 2017 at 11:20 pm #

    Hello~ I am your one of fans.

    I have a question about lstm settings that I may apply your text generation model in a different way.

    Your model generated texts by adding(updating) a character.

    Do you think it is possible to generate texts by adding a word rather than a character.

    If it is possible, adding word_embedding layer is effective for a performance of text generation??

    • Jason Brownlee April 12, 2017 at 7:53 am #

      Yes, I expect it may even work better June. Let me know how you go.

      • JUNETAE KIM April 12, 2017 at 12:15 pm #

        Thanks, Jason.

        I will try it~!

        You know, I am a doctoral student majored in management information system.

        My research interest include medical informatics and healthcare business analytics.

        Your excellent blog posts have helped me a lot.

        I appreciated it so much!

        • Jason Brownlee April 13, 2017 at 9:54 am #

          I’m really glad to hear that. Thanks for your support and your kind words, I really appreciate it.

  28. Dave May 6, 2017 at 11:43 am #


    Very interesting example – can’t wait to try training on other texts!

    I have a question regarding the training of this, or in fact any neural network, on a GPU – I have a couple of CNNs written but that I can’t execute :(.

    Suppose I am using either Amazon EC2 or Google Cloud, I can successfully log into the instance and run a simple ANN using just a CPU but I am totally confused as to how to get the GPU working. I am accessing the instance from Windows 10. Can you tell me the exact steps I need to do – presumably I need to get CUDA and CUDNN somehow? Then is there any other stuff I need to do or can I just pip install the necessary packages and then execute my code?

    Thank you so much.

  29. FractalProb May 15, 2017 at 2:41 pm #

    nice post!
    two questions,
    1) what is the role of “seq_in = [int_to_char[value] for value in pattern]” in line 63 (doesn’t seem to be used in the loop) and,
    2) could you expand on where and how Gustavo’s fix is needed? (I also get repeated predictions using argmax)

    Does your Deep Learning book expand on (2) in regards to LSTM’s?
    It seems problematic to output the same predictions over different inputs.

    • Jason Brownlee May 16, 2017 at 8:37 am #

      seq_in is the input sequence used in prediction. Agreed it is not needed in the latter part of the example.

      His fix is not needed, it can just reduce the size of the universe of possible values. My book does not go into more detail on this.

      I am working on a new book dedicated to LSTMs that may be of interest when it is released (next month).

  30. FractalProb May 16, 2017 at 3:54 am #

    Here is my implementation of Gustavo’s suggestions so that the predicted output tends to avoid repeated patterns

    def sample_prediction(prediction):
    “””Get rand index from preds based on its prob distribution.

    prediction (array (array)): array of length 1 containing array of probs that sums to 1

    rnd_idx (int): random index from prediction[0]

    Helps to solve problem of repeated outputs.

    len(prediction) = 1
    len(prediction[0]) >> 1
    X = prediction[0] # sum(X) is approx 1
    rnd_idx = np.random.choice(len(X), p=X)
    return rnd_idx

    for i in range(num_outputs):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    #index = numpy.argmax(prediction)
    # per Gustavo’s suggestion, we should not use argmax here
    index = sample_prediction(prediction)
    result = int_to_char[index]
    #seq_in = [int_to_char[value] for value in pattern]
    # not sure why seq_in was here
    pattern = pattern[1:len(pattern)]
    print “\nDone.”

  31. Rathish May 16, 2017 at 5:01 am #

    Hi Jason, great post.

    I wanted to ask you if there is a way to train this system or use the current setup with dynamic input length.

    The reason for that is if one to use this for real text generation, given a random seed of variable length say.. “I have a dream” or “To be or not to be” would it be possible to still generate coherent sentences if we train using dynamic length?

    I tried with pad sequence in the predict stage (not the train stage) to match the input length for shorter sentences, but that doesn’t seem to work.


    • Jason Brownlee May 16, 2017 at 8:51 am #

      Yes, you are on the right track.

      I would recommend zero-padding all sequences to be the same length and see how the model fairs.

      You could try Masking the input layer and see what impact that has.

      You could also try truncating sequences.

      I have a post scheduled that gives many ways to handle input sequences of different lengths, perhaps in a few weeks.

      • Rathish May 25, 2017 at 8:34 am #

        That would be much appreciated. Looking forward to it.

  32. Alex May 18, 2017 at 6:48 pm #


    may I ask about these two lines
    seq_in = raw_text[i:i + seq_length]
    seq_out = raw_text[i + seq_length]

    should model predict next one abc -> d instead of abc -> c

  33. abbey May 23, 2017 at 1:40 am #


    I need to thank you for all the good work on Keras, wounderful and awesome package.

    However, i got a floating point exception (core dumped) running the code. Please, i need your advise to resolve the issues. I have upgrade the my keras from 1.0.8 to 2.0.1 and the issues is still the same.

    88/200 [============>……………..] – ETA: 27s – loss: 11.3334 – acc: 0.0971{‘acc’: array(0.10470587760210037, dtype=float32), ‘loss’: array(9.283703804016113, dtype=float32), ‘batch’: 88, ‘size’: 17}
    89/200 [============>……………..] – ETA: 26s – loss: 11.3103 – acc: 0.0972Floating point exception (core dumped)

    Warm Regards

    • Jason Brownlee May 23, 2017 at 7:54 am #

      I’m sorry to hear that.

      Perhaps you could try a different backend (theano or tensorflow)?

      Perhaps you could try posting to stackoverflow or the keras user group?

      • abbey May 23, 2017 at 6:23 pm #

        Thank you Jason.

  34. abbey May 23, 2017 at 9:20 pm #

    Hi guys,

    With Tensorflow backend, i got different error messages. After about 89 batch from the second epochs, the loss become nan and the accuracy also the same. Any suggestion or advices

    87/200 [============>……………..] – ETA: 113s – loss: 9.4303 – acc: 0.0947{‘acc’: 0.10666667, ‘loss’: 9.2033749, ‘batch’: 87, ‘size’: 32}
    88/200 [============>……………..] – ETA: 112s – loss: 9.4277 – acc: 0.0949{‘acc’: 0.10862745, ‘loss’: 9.2667055, ‘batch’: 88, ‘size’: 17}
    89/200 [============>……………..] – ETA: 110s – loss: 9.4259 – acc: 0.0950{‘acc’: nan, ‘loss’: nan, ‘batch’: 89, ‘size’: 0}
    90/200 [============>……………..] – ETA: 108s – loss: nan – acc: nan {‘acc’: nan, ‘loss’: nan, ‘batch’: 90, ‘size’: 0}
    91/200 [============>……………..] – ETA: 106s – loss: nan – acc: nan{‘acc’: nan, ‘loss’: nan, ‘batch’: 91, ‘size’: 0}
    92/200 [============>……………..] – ETA: 105s – loss: nan – acc: nan{‘acc’: nan, ‘loss’: nan, ‘batch’: 92, ‘size’: 0}

  35. Piush May 24, 2017 at 11:31 pm #

    Thanks for the example. Piush.

  36. abbey May 26, 2017 at 3:55 pm #

    Hi Jason,

    I tried your suggestion and the problem of nan for loss and accuracy still the same after second epoch while the batch size is 39/50.

    I have also try all the activation function and regularization but still the same problems. Too bad!

  37. abbey May 30, 2017 at 4:43 pm #

    Hi Jason,

    The problem of loss and accuracy becoming nan after few epoch as to do with the batch generator. I fix it now.

    Thank you so much.

  38. IanEden June 1, 2017 at 8:53 am #

    Hello Jason,

    I was wondering how I could add more hidden layers. I was wondering if this could help generate text with more efficiency

    I tried using model.add(LSTM(some_number)) again but it failed and gave me the error:

    “Input 0 is incompatible with layer lstm_3: expected ndim=3, found ndim=2”

  39. Mo June 8, 2017 at 8:59 am #

    Hello Jason,

    Thanks for your great post it helped me a lot. Since I finished reading your post, I was thinking of how to implement it in a word level instead of character level. I am just confused of how to implement it because with characters we only have few characters but with words we might have say 10000 or even more. Would you please share your thoughts. I am really excited to see the results in a words-level and make further enhancements.

    • Jason Brownlee June 9, 2017 at 6:16 am #

      Great idea.

      I would recommend ranking all words by frequency the assign integers to each word based on rank. New words not in the corpus can then also be assigned integers later.

  40. Adi June 16, 2017 at 12:27 am #

    hi jason , thanks for such an awesome post
    I keep getting the error :
    ImportError: load_weights requires h5py.
    even if i have installed h5py, did anyone else face this error, please help

    • Jason Brownlee June 16, 2017 at 8:04 am #

      Perhaps your environment cannot see the h5py library?

      Confirm you installed it the correct way for your environment.

      Try important the library itself and refine env until it works.

      Let me know how you go.

      • Adi June 18, 2017 at 5:07 pm #

        It was solved, I just restarted python after installing h5py.
        when I predicted using the model, it through the characters like this, whats up with the 1s why are they coming?

        • Jason Brownlee June 19, 2017 at 8:34 am #

          I don’t know. Confirm that you have a copy of the code from the tutorial without modification.

          • Adi June 20, 2017 at 2:10 pm #

            can we use print() in place of sys.stdout.write()

          • Jason Brownlee June 21, 2017 at 8:08 am #


  41. Abbey June 21, 2017 at 9:09 pm #

    Hi Jason,

    Please, I need your help. I try to validate my test data using the already training result from the model but I got the following error:

    AttributeError Traceback (most recent call last)
    in ()
    1 # load weights into new model
    —-> 2 model_info.load_weights(save_best_weights)
    3 predictions = model.predict([test_q1, test_q2, test_q1, test_q2,test_q1, test_q2], verbose = True)

    AttributeError: ‘History’ object has no attribute ‘load_weights’

    In [ ]:

    Below is the snippets of my code:

    ###Fitting model

    save the best weights for predicting the test question pairs

    save_best_weights = “weights-pairs1.h5”

    checkpoint = ModelCheckpoint(filepath, monitor=’loss’, verbose=1, save_best_only=True, mode=’min’)

    callbacks = [ModelCheckpoint(save_best_weights, monitor=’val_loss’, save_best_only=True),
    EarlyStopping(monitor=’val_loss’, patience=5, verbose=1, mode=’auto’)]
    start = time.time()
    model_info=merged_model.fit([x1, x2, x1, x2, x1, x2], y=y, batch_size=64, epochs=3, verbose=True,
    validation_split=0.33, shuffle=True, callbacks=callbacks)
    end = time.time()
    print(“Minutes elapsed: %f” % ((start – end) / 60.))


    #load weights into a new model

    predictions = model.predict([test_q1, test_q2, test_q1, test_q2,test_q1, test_q2], verbose = True)


  42. Kunal chakraborty July 2, 2017 at 8:56 pm #

    Hello Jason,

    Great tutorial. I have just one doubt though, in np.reshape command what does feature mean?
    and why is it set to 1?

    • Jason Brownlee July 3, 2017 at 5:32 am #

      The raw data are sequences of integers. There is only one observation (feature) per time step and it is an integer. That is why the first reshape specifies one feature.

  43. Alex July 4, 2017 at 12:35 am #

    Excellent tutorial!

    I have one question. I need predict some words inside text and I currently use LSTM based on your code and binary coding. Is a good practice to use n-words behind and k-words ahead of word which I want to predict?
    It’s good to train model on data like this or we must rather use only data behind our prediction?

    • Jason Brownlee July 6, 2017 at 10:03 am #

      Try a few method and see what works best on your problem.

  44. Francesca July 11, 2017 at 3:49 pm #

    Thanks, This can make a good reference for me

  45. Joe Melle August 5, 2017 at 6:20 am #

    Why not one hot encode the numbers?

    • Jason Brownlee August 6, 2017 at 7:28 am #

      Sure, try it.

      Also try an encoding layer.

      There are many ways to improve the example Joe, let me know how you go.

  46. Sivasailam September 3, 2017 at 8:32 pm #

    Great post. Thanks much. I am very new to LSTM. I am trying to recreate the codes here. How to increase the number of epochs?

  47. Greg September 5, 2017 at 4:58 pm #

    Hi Jason,

    Thank you so much for these guides and tutorials! I’m finding them to be very helpful.

  48. Don September 7, 2017 at 11:04 am #

    Hi Jason,

    Thanks for your great post, and all the other great post as a matter of fact. I’m interested to train the model on padded sentences rather than random sequences of characters, but it’s not clear to me how to implement it. Can you please elaborate about it and give an example?

    Many thanks!

  49. Don September 8, 2017 at 3:10 pm #

    Thanks again for the post and all the info! I have another question. What should I do if I want to predict by a inserting a string shorter than 100 characters?

    Many thanks again!

  50. Don September 10, 2017 at 2:55 am #


    I have a question regarding the number of parameter in the model and the amount of data. When I look at the summary of the simplest model, I get: Total params: 275,757.0, Trainable params: 275,757.0, and Non-trainable params: 0.0 (for some reason I didn’t succeed to sent a reply with the whole summary).

    The number of characters in the Alice book is about 150,000. Thus, isn’t the number of parameter larger than the number of characters (data)?

    Thanks again!

    • Jason Brownlee September 11, 2017 at 12:02 pm #


      • Don September 11, 2017 at 2:12 pm #

        Isn’t that over-fitting? I ask because you suggested to improve the quality of results by developing an even much larger LSTM network. But if in the simple LSTM network you already have more parameters than data, shouldn’t you simplify the network even more?

        Thanks a lot for all the tips!

        • Jason Brownlee September 13, 2017 at 12:21 pm #

          A simpler model is preferred, but overfitting is only the case when skill on test/validation data is worse than train data.

  51. Manar September 13, 2017 at 4:32 am #

    Thanks Dr.Jason,

    What if I want to output the probability of a sequence under a trained model rather than finding the most probable next charterer.

    The use case that I have in mind is to feed the model test sentence and it prints the probability of this sentence being true under the model

    Much thanks in advance

    • Jason Brownlee September 13, 2017 at 12:34 pm #

      That would be a different framing of the problem, or perhaps you can collect the probabilities of each char step by step.

      • Manar September 14, 2017 at 2:09 am #

        So when I train, Xs is similarly he sequence, but what would the Ys be ?

        • Jason Brownlee September 15, 2017 at 12:07 pm #

          The next char or word in the sequence.

          • Manar September 15, 2017 at 12:38 pm #

            If we do this then how it will differ from the original framing of the problem? I mean the resulted probability will measure the how likely it is a character X will be the next rather than the probability of the seen sequence under the trained model

            e.g. if we feed the model with “I am” and the next word either [egg or Sam]. Following the above-reply the output will be the probability of egg or Sam to be the next. Rather, I need to find the probability of the entire segment ” I am Sam” and “I am egg” to be able to tell which one makes more sense

          • Jason Brownlee September 16, 2017 at 8:35 am #

            Yes, it is the same framing. The difference is how you handle the probabilities. Sorry, I should have been clearer.

            E.g. you can beam search the probabilities across each word/char, or output the probabilities for a specific output sequence, or list the top n most likely output sequences.

            Does that make sense?

  52. Rohit September 18, 2017 at 7:19 pm #

    Hi Jason,
    Thanks for the tutorial. When I am using one-hot encoded input, my accuracy and loss is not improving and becomes stagnant

    def one_hot_encode(sequences, next_chars, char_to_idx):
    X = np.zeros((n_patterns, seq_length, N_CHARS),dtype=float)
    y = np.zeros((n_patterns, N_CHARS),dtype=float)
    for i, sequence in enumerate(sequences):
    for t, char in enumerate(sequence):
    X[i, t, char] = 1
    y[i,[next_chars[i]]] = 1
    return X, y

    X, y = one_hot_encode(dataX, dataY, char_to_int)

    Is it the layer size or a bug in one_hot_encoding?

  53. Vinicius September 20, 2017 at 1:07 am #

    Hi, thanks for the great tutorial!

    If I begin with some random character and use the trained model to predict the next one, how can the network generate different sentences using the same first character?

    • Jason Brownlee September 20, 2017 at 5:59 am #

      It will predict probabilities across all output characters and you can use a beam search through those probabilities to get multiple different output sequences.

  54. Sam September 20, 2017 at 10:15 am #

    I’m attempting to run the lastf ull code example for generating text using the loaded LSTM model.
    However, line 59 simply produces the same number (effectively a space character) each time across the whole 1000 range.
    Not sure what I’m doing wrong ?

    • Jason Brownlee September 20, 2017 at 3:01 pm #

      Sorry to hear that Sam.

      Have you tried to run the example a few times?
      Have you confirmed that your environment is up to date?

Leave a Reply