CNN Long Short-Term Memory Networks

Gentle introduction to CNN LSTM recurrent neural networks
with example Python code.

Input with spatial structure, like images, cannot be modeled easily with the standard Vanilla LSTM.

The CNN Long Short-Term Memory Network or CNN LSTM for short is an LSTM architecture specifically designed for sequence prediction problems with spatial inputs, like images or videos.

In this post, you will discover the CNN LSTM architecture for sequence prediction.

After completing this post, you will know:

  • About the development of the CNN LSTM model architecture for sequence prediction.
  • Examples of the types of problems to which the CNN LSTM model is suited.
  • How to implement the CNN LSTM architecture in Python with Keras.

Let’s get started.

Convolutional Neural Network Long Short-Term Memory Networks

Convolutional Neural Network Long Short-Term Memory Networks
Photo by Yair Aronshtam, some righs reserved.

CNN LSTM Architecture

The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.

CNN LSTMs were developed for visual time series prediction problems and the application of generating textual descriptions from sequences of images (e.g. videos). Specifically, the problems of:

  • Activity Recognition: Generating a textual description of an activity demonstrated in a sequence of images.
  • Image Description: Generating a textual description of a single image.
  • Video Description: Generating a textual description of a sequence of images.

[CNN LSTMs are] a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs

Long-term Recurrent Convolutional Networks for Visual Recognition and Description, 2015.

This architecture was originally referred to as a Long-term Recurrent Convolutional Network or LRCN model, although we will use the more generic name “CNN LSTM” to refer to LSTMs that use a CNN as a front end in this lesson.

This architecture is used for the task of generating textual descriptions of images. Key is the use of a CNN that is pre-trained on a challenging image classification task that is re-purposed as a feature extractor for the caption generating problem.

… it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences

Show and Tell: A Neural Image Caption Generator, 2015.

This architecture has also been used on speech recognition and natural language processing problems where CNNs are used as feature extractors for the LSTMs on audio and textual input data.

This architecture is appropriate for problems that:

  • Have spatial structure in their input such as the 2D structure or pixels in an image or the 1D structure of words in a sentence, paragraph, or document.
  • Have a temporal structure in their input such as the order of images in a video or words in text, or require the generation of output with temporal structure such as words in a textual description.
Convolutional Neural Network Long Short-Term Memory Network Architecture

Convolutional Neural Network Long Short-Term Memory Network Architecture

Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Implement CNN LSTM in Keras

We can define a CNN LSTM model to be trained jointly in Keras.

A CNN LSTM can be defined by adding CNN layers on the front end followed by LSTM layers with a Dense layer on the output.

It is helpful to think of this architecture as defining two sub-models: the CNN Model for feature extraction and the LSTM Model for interpreting the features across time steps.

Let’s take a look at both of these sub models in the context of a sequence of 2D inputs which we will assume are images.

CNN Model

As a refresher, we can define a 2D convolutional network as comprised of Conv2D and MaxPooling2D layers ordered into a stack of the required depth.

The Conv2D will interpret snapshots of the image (e.g. small squares) and the polling layers will consolidate or abstract the interpretation.

For example, the snippet below expects to read in 10×10 pixel images with 1 channel (e.g. black and white). The Conv2D will read the image in 2×2 snapshots and output one new 10×10 interpretation of the image. The MaxPooling2D will pool the interpretation into 2×2 blocks reducing the output to a 5×5 consolidation. The Flatten layer will take the single 5×5 map and transform it into a 25-element vector ready for some other layer to deal with, such as a Dense for outputting a prediction.

This makes sense for image classification and other computer vision tasks.

LSTM Model

The CNN model above is only capable of handling a single image, transforming it from input pixels into an internal matrix or vector representation.

We need to repeat this operation across multiple images and allow the LSTM to build up internal state and update weights using BPTT across a sequence of the internal vector representations of input images.

The CNN could be fixed in the case of using an existing pre-trained model like VGG for feature extraction from images. The CNN may not be trained, and we may wish to train it by backpropagating error from the LSTM across multiple input images to the CNN model.

In both of these cases, conceptually there is a single CNN model and a sequence of LSTM models, one for each time step. We want to apply the CNN model to each input image and pass on the output of each input image to the LSTM as a single time step.

We can achieve this by wrapping the entire CNN input model (one layer or more) in a TimeDistributed layer. This layer achieves the desired outcome of applying the same layer or layers multiple times. In this case, applying it multiple times to multiple input time steps and in turn providing a sequence of “image interpretations” or “image features” to the LSTM model to work on.

We now have the two elements of the model; let’s put them together.

CNN LSTM Model

We can define a CNN LSTM model in Keras by first defining the CNN layer or layers, wrapping them in a TimeDistributed layer and then defining the LSTM and output layers.

We have two ways to define the model that are equivalent and only differ as a matter of taste.

You can define the CNN model first, then add it to the LSTM model by wrapping the entire sequence of CNN layers in a TimeDistributed layer, as follows:

An alternate, and perhaps easier to read, approach is to wrap each layer in the CNN model in a TimeDistributed layer when adding it to the main model.

The benefit of this second approach is that all of the layers appear in the model summary and as such is preferred for now.

You can choose the method that you prefer.

Further Reading

This section provides more resources on the topic if you are looking go deeper.

Papers on CNN LSTM

Keras API

Posts

Summary

In this post, you discovered the CNN LSTN model architecture.

Specifically, you learned:

  • About the development of the CNN LSTM model architecture for sequence prediction.
  • Examples of the types of problems to which the CNN LSTM model is suited.
  • How to implement the CNN LSTM architecture in Python with Keras.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop LSTMs for Sequence Prediction Today!

Long Short-Term Memory Networks with Python

Develop Your Own LSTM models in Minutes

…with just a few lines of python code

Discover how in my new Ebook:
Long Short-Term Memory Networks with Python

It provides self-study tutorials on topics like:
CNN LSTMs, Encoder-Decoder LSTMs, generative models, data preparation, making predictions and much more…

Finally Bring LSTM Recurrent Neural Networks to
Your Sequence Predictions Projects

Skip the Academics. Just Results.

Click to learn more.


58 Responses to CNN Long Short-Term Memory Networks

  1. Erick August 21, 2017 at 6:07 pm #

    Would this architecture, with some adaptations, also be suitable to do speech recognition, speaker separation, language detection and other natural language processing tasks?

  2. birol August 22, 2017 at 6:27 pm #

    what is difference with ConvLSTM2D layer ?
    https://github.com/fchollet/keras/blob/master/examples/conv_lstm.py

    • Jason Brownlee August 23, 2017 at 6:45 am #

      As far as I know, that layer is not yet supported. I have tried to stay away from it until all the bugs are worked out of it.

    • Dan Lim August 25, 2017 at 2:59 pm #

      ConvLSTM is variant of LSTM which use convolution to replace inner procut within LSTM unit
      while CNN LSTM is just stack of layer; CNN followed by LSTM.

      • Jason Brownlee August 25, 2017 at 3:58 pm #

        Have you used it on a project Dan?

        • Dan Lim August 30, 2017 at 2:09 pm #

          Not yet, I’m just waiting next tensorflow release since it seems that convlstm would be provided as tf.contrib.rnn.ConvLSTMCell, instead I’ve used cnn + lstm on simple speech recognition experiments and it gives better results than stack of lstm. It really works!

          • Jason Brownlee August 30, 2017 at 4:18 pm #

            Thanks Dan.

            I hope to try some examples myself for the blog soon.

  3. Miles August 25, 2017 at 7:13 am #

    Hi, Jason.
    Do you think the CNNLSTM can solve the regression problem, whose inputs are some time series data and some properties/exogenous data (spatial), not image data? If yes, how to deal with the properties/exogenous data (2D) in CNN. Thank you.

    • Tahir August 25, 2017 at 2:24 pm #

      I m having the same question

    • Jason Brownlee August 25, 2017 at 3:56 pm #

      Perhaps, I have not tried using CNN LSTMs for time series.

      Perhaps each series could be processed by a 1D-CNN.

      It might not make sense given that the LSTM is already interpreting the long term relationships in the data.

      It might be interesting if the CNN can pick out structure that is new/different from the LSTM. Perhaps you could have both a CNN and LSTM interpretation of the series and use another model to integrate and interpret the results.

    • Anna March 21, 2018 at 5:39 pm #

      Hi,Miles.
      I m having the same question. Do you have some research progress on time series using the CNN LSTMs?

  4. Shamane Siriwardana September 28, 2017 at 1:27 pm #

    Hi do you have a github implementation ?

    • Jason Brownlee September 28, 2017 at 4:45 pm #

      I have a full code example in my book on LSTMs.

  5. Elisa October 6, 2017 at 1:10 pm #

    Hi Jason,
    Thank you for the great work and posts.

    I’m starting my studies with deep learning, python and keras.
    I would like knowing how to implement the CNN with ELM (extreme learning machine) architecture in Python with Keras for classification task. Do you have a github implementation?

  6. gana October 12, 2017 at 4:16 pm #

    Thank you for your great examples…

    May i ask you full code of the CNN LSTM you explained above?
    Because,..i am having errors related to dimensions of CNN and LSTM.

    I have followed your previous examples and trying to build VGG-16Net stacked with LSTM.

    My database is just 10 different human motion (10 classes) such as walking and running etc…

    My code is as below:

    # dimensions of our images.
    img_width, img_height = 224, 224

    train_data_dir = ‘db/train’
    validation_data_dir = ‘db/test’
    nb_train_samples = 400
    nb_validation_samples = 200
    num_timesteps = 10 # length of sequence
    num_class = 10
    epochs = 10
    batch_size = 8

    lstm_input_len = 224 * 224
    input_shape=(224,224,3)
    num_chan = 3

    # VGG16 as CNN
    cnn = Sequential()
    cnn.add(ZeroPadding2D((1,1),input_shape=input_shape))
    cnn.add(Conv2D(64, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(64, 3, 3, activation=’relu’))
    cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(128, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(128, 3, 3, activation=’relu’))
    cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(256, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(256, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(256, 3, 3, activation=’relu’))
    cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(512, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(512, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(512, 3, 3, activation=’relu’))
    cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(512, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(512, 3, 3, activation=’relu’))
    cnn.add(ZeroPadding2D((1,1)))
    cnn.add(Conv2D(512, 3, 3, activation=’relu’))
    cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

    cnn.add(Flatten())
    cnn.add(Dense(4096, activation=’relu’))
    cnn.add(Dropout(0.5))
    cnn.add(Dense(4096, activation=’relu’))

    #LSTM
    model = Sequential()
    model.add(TimeDistributed(cnn, input_shape=(num_timesteps, 224, 224,num_chan)))
    model.add(LSTM(num_timesteps))
    model.add(Dropout(.2)) #added
    model.add(Dense(num_class, activation=’softmax’))

    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    # this is the augmentation configuration we will use for training
    train_datagen = ImageDataGenerator(rescale=1. / 255)

    # this is the augmentation configuration we will use for testing:
    # only rescaling
    test_datagen = ImageDataGenerator(rescale=1. / 255)

    train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size=(224, 224),
    batch_size=batch_size,
    class_mode=’binary’)

    validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size=(224, 224),
    batch_size=batch_size,
    class_mode=’binary’)

    model.fit_generator(
    train_generator,
    steps_per_epoch=nb_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=nb_validation_samples // batch_size)

    • gana October 12, 2017 at 4:18 pm #

      I forgot to put error which is :

      ValueError: Error when checking input: expected time_distributed_9_input to have 5 dimensions, but got array with shape (8, 224, 224, 3)

      • N1k31t4 November 25, 2017 at 2:13 am #

        You aso need to specify a batch size in the input dimensions to that layer I guess, to get the fifth dimension. Try using: model.add(TimeDistributed(cnn, input_shape=(None, num_timesteps, 224, 224,num_chan))). The None will then allow variable batch size.

        • Mohamed November 25, 2017 at 2:15 am #

          yes that worked for me. Thanks

      • Yi Li April 16, 2018 at 6:05 am #

        I got the same error, have you solved it? May I ask you the way to solve it?

    • Jason Brownlee October 13, 2017 at 5:43 am #

      Sorry, I cannot debug your code. I list some places to get help with code here:
      https://machinelearningmastery.com/get-help-with-keras/

  7. Long October 15, 2017 at 12:33 pm #

    Hi Jason,

    Assuming there are a data set with time series data (e.g temperature, rainfall) and geographic data(e.g. elevation, slope) for many grid positions, I need to use the data set to predict(regression) future weathers.

    I think of a method with LSTM (for time series data) + auxiliary (geographic data) to be a solution. But the results of forecast is not very good. Do you have other better methods? Or do you have a related lessons?

    Thank you very much.

  8. Ravindra December 7, 2017 at 8:58 pm #

    Hi Jason, Thanks a lot for this. I am having trouble implementing the same architecture of TimeDistributed CNN with LSTM using functional API. It is throwing an error when I pass the TImeDistributed layer to maxpooling step saying the input is not a tensor. Could you please put few lines of code for the Timedistributed CNN output into LSTM using functional API?

  9. Alex December 9, 2017 at 5:27 am #

    Hi Jason,

    How would I implement a CNN-LSTM classification problem with variable input lengths?

  10. Alex December 9, 2017 at 5:57 am #

    With the padding approach, I am worried the LSTM might learn a dependency between sequence length and classification.

    My data is structured such that sequences with more inputs are MUCH more likely to be a certain class than sequences with less inputs. However, I don’t want my model to learn this dependency.

    Is my intuition correct? I remember reading in your earlier article that the LSTM will learn to ignore the padded sequences, but I wasn’t sure to what extent.

  11. Rui December 21, 2017 at 4:05 am #

    How to apply conv operation to the sequence itself instead of features (time sample data) ?

  12. Alex February 18, 2018 at 10:26 am #

    Nice intro, but it’s very incomplete. After reading this I know how to build a CNN LSTM, but I still don’t have any concept of what the input to it looks like, and therefore I don’t know how to train it. What does the input to the network look like, exactly? How do I reconcile the concepts of having a batch size but at the same time my input being a sequence? For someone who has never used RNNs before, this is not at all clear.

    • Jason Brownlee February 19, 2018 at 9:00 am #

      It really depends on the application, e.g. the specifics of the problem to which you wish to apply this method.

  13. Mary March 9, 2018 at 6:06 am #

    what is the difference between using the LSTM you show here and using the encoder decoder LSTM model in case of Video and image description?

  14. Vinay Rajpoot March 9, 2018 at 5:38 pm #

    Can it be used for video summarization. Do you have a code for it?

    • Jason Brownlee March 10, 2018 at 6:23 am #

      Perhaps. I don’t have a worked example for video summarization.

  15. Kanika March 20, 2018 at 8:21 am #

    You say : ” In both of these cases, conceptually there is a single CNN model and a sequence of LSTM models, one for each time step”

    Can you please explain me on how is back propogation working here ? Assuming my sequence length is T, I have confusion as follow :

    First interpretation : If a interpret in a way that for each LSTM unit I have corresponding CNN unit. So if input sequence of length T, I have T LSTM’s and corresponding T CNN’s. Then if I am assuming that I am learning weights by back propagation, then shouldn’t all the CNN’s have different weights ? How could all CNN have weight shared across time ?

    Second interpretation : Only one CNN and T LSTM. Features across T frames extracted using the same CNN and passed on to T LSTM’s with different weights. But then how is this kind of network learning weights for the CNN.

    I have really spent alot of time to understand but I am still confused. Would be really really helpful if you could answer 🙂

  16. Shivali Goel March 21, 2018 at 7:16 am #

    What should the input look like in terms of shape?

    for e.g. for a 45*45 image:
    x_train.shape = (num_images, 45,45,num_channels)

    y_train.shape =???

    • Shivali Goel March 21, 2018 at 7:17 am #

      heres the code & image is actually 56*56*1

      print “building model…”

      model = Sequential()
      # define CNN model
      model.add(TimeDistributed(Conv2D(32, (3, 3), activation = ‘relu’),input_shape = (None, 56, 56, 1)))
      model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))
      model.add(TimeDistributed(Flatten()))

      # define LSTM model
      model.add(LSTM(256,activation=’tanh’, return_sequences=True))
      model.add(Dropout(0.1))
      model.add(LSTM(256,activation=’tanh’, return_sequences=True))
      model.add(Dropout(0.1))
      model.add(Dense(2))
      model.add(Activation(‘softmax’))

      model.compile(loss=’binary_crossentropy’,
      optimizer=’adam’,
      class_mode=’binary’, metrics=[‘accuracy’])

      print model.summary()
      batch_size=1
      nb_epoch=100
      print len(final_input)
      print len(final_input1)

      X_train = numpy.array(final_input)
      X_test = numpy.array(final_input1)

      #y_train = numpy.array(y_train)
      #y_test = numpy.array(y_test)

      #y_train = y_train.reshape((10000,1))
      #y_test = y_test.reshape((1000,1))

      print “printing final shapes…”
      print “X_train: “, X_train.shape
      print “y_train: “, y_train.shape
      print “X_test: “, X_test.shape
      print “y_test: “, y_test.shape
      print

      print(‘Train…’)

      model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,
      validation_data=(X_test, y_test))

      print(‘Evaluate…’)
      score, acc = model.evaluate(X_test, y_test, batch_size=batch_size,
      show_accuracy=True)
      print(‘Test score:’, score)
      print(‘Test accuracy:’, acc)

    • Jason Brownlee March 21, 2018 at 3:05 pm #

      shape = num_images, k

      Where k is the number of classes or 1 for binary classification.

  17. Fathi April 9, 2018 at 6:28 am #

    Hi, I’m working on a CNN LSTM Network. When I compile the following code I get the error below. I have an input_shape but I still get an error when I compile the code. Can you please help me.

    Thank you.

    Code :

    # Importing the Keras libraries and packages

    from keras.models import Sequential
    from keras.layers import Conv2D
    from keras.layers import MaxPooling2D
    from keras.layers import Flatten
    from keras.layers import Dense
    from keras.layers import LSTM
    from keras.layers import Dropout
    from keras.layers import TimeDistributed

    # Initialising the CNN

    classifier = Sequential()

    # Step 1 – Convolutionclassifier = Sequential()

    classifier.add(TimeDistributed(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = ‘relu’)))

    Error :

    ValueError: The first layer in a Sequential model must get an input_shape or batch_input_shape argument.

    • Jason Brownlee April 10, 2018 at 6:08 am #

      That is odd, I’m not sure what is going on there.

      • Fathi April 10, 2018 at 7:43 am #

        Do you have some advice for this situation please ?

        • Jason Brownlee April 11, 2018 at 6:28 am #

          Yes, I would recommend carefully debugging your code.

  18. Skye April 11, 2018 at 12:54 pm #

    Hi Jason,

    Thanks for your share!

    And is the convLSTM appropriate to solve the sea surface temperature prediction? I mean that we will input a sequence of grid maps and get the next temperature grid map?

    • Jason Brownlee April 11, 2018 at 4:19 pm #

      Perhaps. Try it and see.

      • Skye April 13, 2018 at 10:23 am #

        OK. Thank you! And do you have any suggestions for how the model should be modified for this problem?

Leave a Reply