### Gentle introduction to CNN LSTM recurrent neural networks

with example Python code.

Input with spatial structure, like images, cannot be modeled easily with the standard Vanilla LSTM.

The CNN Long Short-Term Memory Network or CNN LSTM for short is an LSTM architecture specifically designed for sequence prediction problems with spatial inputs, like images or videos.

In this post, you will discover the CNN LSTM architecture for sequence prediction.

After completing this post, you will know:

- About the development of the CNN LSTM model architecture for sequence prediction.
- Examples of the types of problems to which the CNN LSTM model is suited.
- How to implement the CNN LSTM architecture in Python with Keras.

Let’s get started.

## CNN LSTM Architecture

The CNN LSTM architecture involves using Convolutional Neural Network (CNN) layers for feature extraction on input data combined with LSTMs to support sequence prediction.

CNN LSTMs were developed for visual time series prediction problems and the application of generating textual descriptions from sequences of images (e.g. videos). Specifically, the problems of:

**Activity Recognition**: Generating a textual description of an activity demonstrated in a sequence of images.**Image Description**: Generating a textual description of a single image.**Video Description**: Generating a textual description of a sequence of images.

[CNN LSTMs are] a class of models that is both spatially and temporally deep, and has the flexibility to be applied to a variety of vision tasks involving sequential inputs and outputs

— Long-term Recurrent Convolutional Networks for Visual Recognition and Description, 2015.

This architecture was originally referred to as a Long-term Recurrent Convolutional Network or LRCN model, although we will use the more generic name “CNN LSTM” to refer to LSTMs that use a CNN as a front end in this lesson.

This architecture is used for the task of generating textual descriptions of images. Key is the use of a CNN that is pre-trained on a challenging image classification task that is re-purposed as a feature extractor for the caption generating problem.

… it is natural to use a CNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences

— Show and Tell: A Neural Image Caption Generator, 2015.

This architecture has also been used on speech recognition and natural language processing problems where CNNs are used as feature extractors for the LSTMs on audio and textual input data.

This architecture is appropriate for problems that:

- Have spatial structure in their input such as the 2D structure or pixels in an image or the 1D structure of words in a sentence, paragraph, or document.
- Have a temporal structure in their input such as the order of images in a video or words in text, or require the generation of output with temporal structure such as words in a textual description.

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Implement CNN LSTM in Keras

We can define a CNN LSTM model to be trained jointly in Keras.

A CNN LSTM can be defined by adding CNN layers on the front end followed by LSTM layers with a Dense layer on the output.

It is helpful to think of this architecture as defining two sub-models: the CNN Model for feature extraction and the LSTM Model for interpreting the features across time steps.

Let’s take a look at both of these sub models in the context of a sequence of 2D inputs which we will assume are images.

### CNN Model

As a refresher, we can define a 2D convolutional network as comprised of Conv2D and MaxPooling2D layers ordered into a stack of the required depth.

The Conv2D will interpret snapshots of the image (e.g. small squares) and the polling layers will consolidate or abstract the interpretation.

For example, the snippet below expects to read in 10×10 pixel images with 1 channel (e.g. black and white). The Conv2D will read the image in 2×2 snapshots and output one new 10×10 interpretation of the image. The MaxPooling2D will pool the interpretation into 2×2 blocks reducing the output to a 5×5 consolidation. The Flatten layer will take the single 5×5 map and transform it into a 25-element vector ready for some other layer to deal with, such as a Dense for outputting a prediction.

1 2 3 4 |
cnn = Sequential() cnn.add(Conv2D(1, (2,2), activation='relu', padding='same', input_shape=(10,10,1))) cnn.add(MaxPooling2D(pool_size=(2, 2))) cnn.add(Flatten()) |

This makes sense for image classification and other computer vision tasks.

### LSTM Model

The CNN model above is only capable of handling a single image, transforming it from input pixels into an internal matrix or vector representation.

We need to repeat this operation across multiple images and allow the LSTM to build up internal state and update weights using BPTT across a sequence of the internal vector representations of input images.

The CNN could be fixed in the case of using an existing pre-trained model like VGG for feature extraction from images. The CNN may not be trained, and we may wish to train it by backpropagating error from the LSTM across multiple input images to the CNN model.

In both of these cases, conceptually there is a single CNN model and a sequence of LSTM models, one for each time step. We want to apply the CNN model to each input image and pass on the output of each input image to the LSTM as a single time step.

We can achieve this by wrapping the entire CNN input model (one layer or more) in a TimeDistributed layer. This layer achieves the desired outcome of applying the same layer or layers multiple times. In this case, applying it multiple times to multiple input time steps and in turn providing a sequence of “image interpretations” or “image features” to the LSTM model to work on.

1 2 3 |
model.add(TimeDistributed(...)) model.add(LSTM(...)) model.add(Dense(...)) |

We now have the two elements of the model; let’s put them together.

### CNN LSTM Model

We can define a CNN LSTM model in Keras by first defining the CNN layer or layers, wrapping them in a TimeDistributed layer and then defining the LSTM and output layers.

We have two ways to define the model that are equivalent and only differ as a matter of taste.

You can define the CNN model first, then add it to the LSTM model by wrapping the entire sequence of CNN layers in a TimeDistributed layer, as follows:

1 2 3 4 5 6 7 8 9 10 |
# define CNN model cnn = Sequential() cnn.add(Conv2D(...)) cnn.add(MaxPooling2D(...)) cnn.add(Flatten()) # define LSTM model model = Sequential() model.add(TimeDistributed(cnn, ...)) model.add(LSTM(..)) model.add(Dense(...)) |

An alternate, and perhaps easier to read, approach is to wrap each layer in the CNN model in a TimeDistributed layer when adding it to the main model.

1 2 3 4 5 6 7 8 |
model = Sequential() # define CNN model model.add(TimeDistributed(Conv2D(...)) model.add(TimeDistributed(MaxPooling2D(...))) model.add(TimeDistributed(Flatten())) # define LSTM model model.add(LSTM(...)) model.add(Dense(...)) |

The benefit of this second approach is that all of the layers appear in the model summary and as such is preferred for now.

You can choose the method that you prefer.

## Further Reading

This section provides more resources on the topic if you are looking go deeper.

### Papers on CNN LSTM

- Long-term Recurrent Convolutional Networks for Visual Recognition and Description, 2015.
- Show and Tell: A Neural Image Caption Generator, 2015.
- Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, 2015.
- Character-Aware Neural Language Models, 2015.
- Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, 2015.

### Keras API

### Posts

- Crash Course in Convolutional Neural Networks for Machine Learning
- Sequence Classification with LSTM Recurrent Neural Networks in Python with Keras

## Summary

In this post, you discovered the CNN LSTN model architecture.

Specifically, you learned:

- About the development of the CNN LSTM model architecture for sequence prediction.
- Examples of the types of problems to which the CNN LSTM model is suited.
- How to implement the CNN LSTM architecture in Python with Keras.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Would this architecture, with some adaptations, also be suitable to do speech recognition, speaker separation, language detection and other natural language processing tasks?

Perhaps.

I have seen it most used for document classification / sentiment analysis in NLP.

Hi Jason, could you please provide a few references for the use of such CNN + LSTM architectures in the text domain (document classification, sentiment analysis, etc.)?

Perhaps search on scholar.google.com.

Hi, Jason，I am very distressed,I would like to ask you a question.For example, the data of 0-500, the magnitude of data is quite different.When I use LSTM model to predict, the accuracy is too low.Even if the data normalization is not helpful, I would like to ask you, how should the data be processed?Thank you so much!

Perhaps normalize the data first?

cnn + lstm architecture used for speech recognition

– https://arxiv.org/pdf/1610.03022.pdf

Thanks for sharing Dan.

I want to use Conv1D to process time series data and then input it into LSTM. Can you recommend learning code practice?

what is difference with ConvLSTM2D layer ?

https://github.com/fchollet/keras/blob/master/examples/conv_lstm.py

As far as I know, that layer is not yet supported. I have tried to stay away from it until all the bugs are worked out of it.

ConvLSTM is variant of LSTM which use convolution to replace inner procut within LSTM unit

while CNN LSTM is just stack of layer; CNN followed by LSTM.

Have you used it on a project Dan?

Not yet, I’m just waiting next tensorflow release since it seems that convlstm would be provided as tf.contrib.rnn.ConvLSTMCell, instead I’ve used cnn + lstm on simple speech recognition experiments and it gives better results than stack of lstm. It really works!

Thanks Dan.

I hope to try some examples myself for the blog soon.

@Dan Lim can you share me your script in speech recognition and thanks you.

Hi, Jason.

Do you think the CNNLSTM can solve the regression problem, whose inputs are some time series data and some properties/exogenous data (spatial), not image data? If yes, how to deal with the properties/exogenous data (2D) in CNN. Thank you.

I m having the same question

Perhaps, I have not tried using CNN LSTMs for time series.

Perhaps each series could be processed by a 1D-CNN.

It might not make sense given that the LSTM is already interpreting the long term relationships in the data.

It might be interesting if the CNN can pick out structure that is new/different from the LSTM. Perhaps you could have both a CNN and LSTM interpretation of the series and use another model to integrate and interpret the results.

I tried to use CNN + LSTM for timeseries forecasting, hoping that CNN can uncover some structure in the input signals. So far, it seems to perform worse than a 2-layered LSTM model, even after tuning hyperparameters. I thought I would get your book to look at the details, but sounds like this was not covered in the book? Your previous posting on LSTM model was very helpful. Thank you!

Generally, LSTMs perform worse on every time series problem I have tried them on (20+).

You can learn why here:

https://machinelearningmastery.com/suitability-long-short-term-memory-networks-time-series-forecasting/

I recommend exhausting classical time series methods first, then try sklearn models, and then perhaps try neural nets.

@Jen Liu, would like to see you manage to uncover some of the hidden signals for your implementation. Can you please share some insight on your CNN + LSTM for time series forecasting? thank you.

I have been using the approach recently with great success.

I have posts scheduled on the topic.

Hi,Miles.

I m having the same question. Do you have some research progress on time series using the CNN LSTMs?

Hi do you have a github implementation ?

I have a full code example in my book on LSTMs.

Hi Jason,

Thank you for the great work and posts.

I’m starting my studies with deep learning, python and keras.

I would like knowing how to implement the CNN with ELM (extreme learning machine) architecture in Python with Keras for classification task. Do you have a github implementation?

Sorry, I do not.

Thank you for your great examples…

May i ask you full code of the CNN LSTM you explained above?

Because,..i am having errors related to dimensions of CNN and LSTM.

I have followed your previous examples and trying to build VGG-16Net stacked with LSTM.

My database is just 10 different human motion (10 classes) such as walking and running etc…

My code is as below:

# dimensions of our images.

img_width, img_height = 224, 224

train_data_dir = ‘db/train’

validation_data_dir = ‘db/test’

nb_train_samples = 400

nb_validation_samples = 200

num_timesteps = 10 # length of sequence

num_class = 10

epochs = 10

batch_size = 8

lstm_input_len = 224 * 224

input_shape=(224,224,3)

num_chan = 3

# VGG16 as CNN

cnn = Sequential()

cnn.add(ZeroPadding2D((1,1),input_shape=input_shape))

cnn.add(Conv2D(64, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(64, 3, 3, activation=’relu’))

cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(128, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(128, 3, 3, activation=’relu’))

cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(256, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(256, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(256, 3, 3, activation=’relu’))

cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(512, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(512, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(512, 3, 3, activation=’relu’))

cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(512, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(512, 3, 3, activation=’relu’))

cnn.add(ZeroPadding2D((1,1)))

cnn.add(Conv2D(512, 3, 3, activation=’relu’))

cnn.add(MaxPooling2D((2,2), strides=(2,2),dim_ordering=”th”))

cnn.add(Flatten())

cnn.add(Dense(4096, activation=’relu’))

cnn.add(Dropout(0.5))

cnn.add(Dense(4096, activation=’relu’))

#LSTM

model = Sequential()

model.add(TimeDistributed(cnn, input_shape=(num_timesteps, 224, 224,num_chan)))

model.add(LSTM(num_timesteps))

model.add(Dropout(.2)) #added

model.add(Dense(num_class, activation=’softmax’))

model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

# this is the augmentation configuration we will use for training

train_datagen = ImageDataGenerator(rescale=1. / 255)

# this is the augmentation configuration we will use for testing:

# only rescaling

test_datagen = ImageDataGenerator(rescale=1. / 255)

train_generator = train_datagen.flow_from_directory(

train_data_dir,

target_size=(224, 224),

batch_size=batch_size,

class_mode=’binary’)

validation_generator = test_datagen.flow_from_directory(

validation_data_dir,

target_size=(224, 224),

batch_size=batch_size,

class_mode=’binary’)

model.fit_generator(

train_generator,

steps_per_epoch=nb_train_samples // batch_size,

epochs=epochs,

validation_data=validation_generator,

validation_steps=nb_validation_samples // batch_size)

I forgot to put error which is :

ValueError: Error when checking input: expected time_distributed_9_input to have 5 dimensions, but got array with shape (8, 224, 224, 3)

You aso need to specify a batch size in the input dimensions to that layer I guess, to get the fifth dimension. Try using:

`model.add(TimeDistributed(cnn, input_shape=(None, num_timesteps, 224, 224,num_chan)))`

. The`None`

will then allow variable batch size.yes that worked for me. Thanks

I got the same error, have you solved it? May I ask you the way to solve it?

Sorry, I cannot debug your code. I list some places to get help with code here:

https://machinelearningmastery.com/get-help-with-keras/

Hello. How do you feed inputs to your network ? How is it sure that they are fed sequentially?

Hi Jason,

Assuming there are a data set with time series data (e.g temperature, rainfall) and geographic data(e.g. elevation, slope) for many grid positions, I need to use the data set to predict(regression) future weathers.

I think of a method with LSTM (for time series data) + auxiliary (geographic data) to be a solution. But the results of forecast is not very good. Do you have other better methods? Or do you have a related lessons?

Thank you very much.

Perhaps a deep MLP with a window over lag obs.

Hi Jason,

Could you please explain it in detail? Thank you very much.

This function will help you reshape your time series data to be a supervised learning problem:

https://machinelearningmastery.com/convert-time-series-supervised-learning-problem-python/

You can then use a neural network model.

Hi Jason, Thanks a lot for this. I am having trouble implementing the same architecture of TimeDistributed CNN with LSTM using functional API. It is throwing an error when I pass the TImeDistributed layer to maxpooling step saying the input is not a tensor. Could you please put few lines of code for the Timedistributed CNN output into LSTM using functional API?

Perhaps try posting your code to stackoverflow?

Having the same problem, help would be appreciated!

Perhaps try using the Sequential API instead?

Perhaps try posting to one of these locations:

https://machinelearningmastery.com/get-help-with-keras/

Hi Jason,

How would I implement a CNN-LSTM classification problem with variable input lengths?

Use padding or truncating to make the inputs the same length:

https://machinelearningmastery.com/handle-long-sequences-long-short-term-memory-recurrent-neural-networks/

With the padding approach, I am worried the LSTM might learn a dependency between sequence length and classification.

My data is structured such that sequences with more inputs are MUCH more likely to be a certain class than sequences with less inputs. However, I don’t want my model to learn this dependency.

Is my intuition correct? I remember reading in your earlier article that the LSTM will learn to ignore the padded sequences, but I wasn’t sure to what extent.

You can use a Mask to ignore padded values.

How to apply conv operation to the sequence itself instead of features (time sample data) ?

What do you mean exactly?

Nice intro, but it’s very incomplete. After reading this I know how to build a CNN LSTM, but I still don’t have any concept of what the input to it looks like, and therefore I don’t know how to train it. What does the input to the network look like, exactly? How do I reconcile the concepts of having a batch size but at the same time my input being a sequence? For someone who has never used RNNs before, this is not at all clear.

It really depends on the application, e.g. the specifics of the problem to which you wish to apply this method.

what is the difference between using the LSTM you show here and using the encoder decoder LSTM model in case of Video and image description?

A difference in architecture.

Can it be used for video summarization. Do you have a code for it?

Perhaps. I don’t have a worked example for video summarization.

You say : ” In both of these cases, conceptually there is a single CNN model and a sequence of LSTM models, one for each time step”

Can you please explain me on how is back propogation working here ? Assuming my sequence length is T, I have confusion as follow :

First interpretation : If a interpret in a way that for each LSTM unit I have corresponding CNN unit. So if input sequence of length T, I have T LSTM’s and corresponding T CNN’s. Then if I am assuming that I am learning weights by back propagation, then shouldn’t all the CNN’s have different weights ? How could all CNN have weight shared across time ?

Second interpretation : Only one CNN and T LSTM. Features across T frames extracted using the same CNN and passed on to T LSTM’s with different weights. But then how is this kind of network learning weights for the CNN.

I have really spent alot of time to understand but I am still confused. Would be really really helpful if you could answer 🙂

This post will help understand BPTT:

https://machinelearningmastery.com/gentle-introduction-backpropagation-time/

The LSTM is taking the interpretation of the input from the CNN (e.g. the CNNs output), which provides more perception of the input in the case of images and sometime text data.

Thanks for sharing BPTT link. I understood how the LSTM weights will be updated. But how about CNN weights ? I have attached the image of simple network.

https://drive.google.com/file/d/1J6-iLpEbNFL32Du-3il_ztw8jrMIZVSD/view?usp=sharing

If back propagation works like that, then after end to end training, wouldn’t all CNN weights be different. I yes then how are they same CNN ?

I believe error is propagated back for each time step.

What should the input look like in terms of shape?

for e.g. for a 45*45 image:

x_train.shape = (num_images, 45,45,num_channels)

y_train.shape =???

heres the code & image is actually 56*56*1

print “building model…”

model = Sequential()

# define CNN model

model.add(TimeDistributed(Conv2D(32, (3, 3), activation = ‘relu’),input_shape = (None, 56, 56, 1)))

model.add(TimeDistributed(MaxPooling2D(pool_size=(2, 2))))

model.add(TimeDistributed(Flatten()))

# define LSTM model

model.add(LSTM(256,activation=’tanh’, return_sequences=True))

model.add(Dropout(0.1))

model.add(LSTM(256,activation=’tanh’, return_sequences=True))

model.add(Dropout(0.1))

model.add(Dense(2))

model.add(Activation(‘softmax’))

model.compile(loss=’binary_crossentropy’,

optimizer=’adam’,

class_mode=’binary’, metrics=[‘accuracy’])

print model.summary()

batch_size=1

nb_epoch=100

print len(final_input)

print len(final_input1)

X_train = numpy.array(final_input)

X_test = numpy.array(final_input1)

#y_train = numpy.array(y_train)

#y_test = numpy.array(y_test)

#y_train = y_train.reshape((10000,1))

#y_test = y_test.reshape((1000,1))

print “printing final shapes…”

print “X_train: “, X_train.shape

print “y_train: “, y_train.shape

print “X_test: “, X_test.shape

print “y_test: “, y_test.shape

print

print(‘Train…’)

model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,

validation_data=(X_test, y_test))

print(‘Evaluate…’)

score, acc = model.evaluate(X_test, y_test, batch_size=batch_size,

show_accuracy=True)

print(‘Test score:’, score)

print(‘Test accuracy:’, acc)

shape = num_images, k

Where k is the number of classes or 1 for binary classification.

Hi, I’m working on a CNN LSTM Network. When I compile the following code I get the error below. I have an input_shape but I still get an error when I compile the code. Can you please help me.

Thank you.

Code :

# Importing the Keras libraries and packages

from keras.models import Sequential

from keras.layers import Conv2D

from keras.layers import MaxPooling2D

from keras.layers import Flatten

from keras.layers import Dense

from keras.layers import LSTM

from keras.layers import Dropout

from keras.layers import TimeDistributed

# Initialising the CNN

classifier = Sequential()

# Step 1 – Convolutionclassifier = Sequential()

classifier.add(TimeDistributed(Conv2D(32, (3, 3), input_shape = (64, 64, 3), activation = ‘relu’)))

Error :

ValueError: The first layer in a Sequential model must get an

`input_shape`

or`batch_input_shape`

argument.That is odd, I’m not sure what is going on there.

Do you have some advice for this situation please ?

Yes, I would recommend carefully debugging your code.

Hi Fathy, did you get the solution of your problem? I am going through same trouble. if you get solution then let me know please

You need to specify the time dimension: input_shape = (#time steps per sample, 64, 64, 3)

Hi Jason,

Thanks for your share!

And is the convLSTM appropriate to solve the sea surface temperature prediction? I mean that we will input a sequence of grid maps and get the next temperature grid map?

Perhaps. Try it and see.

OK. Thank you! And do you have any suggestions for how the model should be modified for this problem?

Hi, did you solve your problem? I am confused about importing these images as input for convLSTM? Can you share the code about data input, please?

Hey there,

Thanks for your informative post… It was very useful!

I want to some similar task but a bit more complicated. Consider that we want to generalize or network to be able to use for different sizes. Therefore we need to look at frames in patch scale and then effect of patches of an image result image effect and then images result for the video. (Note that resizing is not possible in my case!)

In other words consider we want to use video in the network in which each video has a different number of frames and also frames of different videos may have different number of patches considering different frame size for different videos. Therefore the input dimension should be e.g. [None(for batch),None(for frame), None(for patch),100,100,3]

Actually I could not do its programming with Keras or TensorFlow! Would you please help with this?

With videos that have a different number of frames, you could:

– normalize to have the same number of frames

– pad videos to have the same number of frames and maybe use a masking layer

– use a network that is indifferent to sequence length

– … more ideas

With different patch sizes, I think you mean different filter sizes. If so, you can use a multiple input model as described here:

https://machinelearningmastery.com/keras-functional-api-deep-learning/

Does that help?

Hi Jason,

Thanks for your blog! I have some questions about how to apply this integrated model for my data. Now, I have time-series images with multiple bands for crop yield regression, how do I import these data as input for this model? Can you give me any examples or some references I can go to? Thanks so much!

What problem are you having exactly?

You can load images using Python tools, such as PIL.

http://www.pythonware.com/products/pil/

I give a fuller example in the book with contrived images:

https://machinelearningmastery.com/lstms-with-python/

i get an error “ValueError: The first layer in a Sequential model must get an

`input_shape`

or`batch_input_shape`

argument.” when I run the following code of my model:model=Sequential()

K.set_image_dim_ordering(‘th’)

model.add(TimeDistributed(Convolution2D(64, (2,2), border_mode= ‘valid’ , input_shape=(1, 2, 2), activation= ‘relu’)))

model.add(TimeDistributed(Convolution2D(64, (1,1), border_mode= ‘valid’, activation= ‘relu’)))

model.add(TimeDistributed(MaxPooling2D(pool_size=(1,1))))

model.add(TimeDistributed(Convolution2D(64, (1,1), activation= ‘relu’ )))

model.add(TimeDistributed(MaxPooling2D(pool_size=(1,1))))

model.add(TimeDistributed(Dropout(0.0)))

model.add(TimeDistributed(Flatten()))

model.add(TimeDistributed(Dense(16, activation= ‘relu’ )))

model.add(TimeDistributed(Dense(16, activation= ‘relu’ )))

#lstm

m=Sequential()

m.add(LSTM(units = 1, activation=’sigmoid’))

I’m eager to help, but I cannot debug your code for you.

Perhaps post to stackoverflow?

Hey Jason, This example is very enlightening!

I’m currently aiming to do anomaly detection on some radio-astronomic data, which consists is .tiff image files, where horizontal axis is the time stamp, and vertical is frequencies. In this case, using the frequencies axis as a space (since signals come in varied frequencies) do you think it would be better to apply a 1D convolutional layer than just using a normal LSTM layer when encoding the images?. I understand there is a spatial dependence in my data, but it’s only 1-dimensional. I would like to know your opinion about this.

Btw, got your machine learning/deep learning/LSTM bundle, You’ve been my mentor these past months!

Yes, try 1D CNNs.

For example, 1d CNNs are useful for sequences of words as input which has parallels with what you’re describing I think.

Hey Jason your post are really good!, I was reading your book and try to apply the example for time series classification problem using a sequence of time series images like this guy does in his post: http://amunategui.github.io/unconventional-convolutional-networks/index.html

my images are 20000 (each frame is adding next 30 minute price), “50×50″ 1 channel,

The problem is that using all of regularization that I could, almost all of my architectures are about 0.51 accuracy, this is the last that I made:

model = Sequential()

model.add(TimeDistributed(Conv2D(5, (3,3), kernel_initializer=”he_normal”, activation= ‘relu’,kernel_regularizer=l2(0.0001)),

input_shape=(None,img_rows,img_cols,1)))

model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1,1))))

model.add(TimeDistributed(Dropout(0.75)))

model.add(TimeDistributed(BatchNormalization()))

model.add(TimeDistributed(Conv2D(3, (2,2), kernel_initializer=”he_normal”, activation= ‘relu’,kernel_regularizer=l2(0.0001))))

model.add(TimeDistributed(MaxPooling2D((2, 2), strides=(1, 1))))

model.add(TimeDistributed(Dropout(0.75)))

model.add(TimeDistributed(Flatten()))

model.add(Bidirectional(LSTM(50)))

model.add(Dropout(0.7))

model.add(Dense(num_classes, activation=’softmax’))

model.summary()

model.compile(loss=’binary_crossentropy’, optimizer=keras.optimizers.Adam(lr=1e-6),metrics=[‘accuracy’])

So I wanted to ask you, how could you avoid overffiting in this type of architectures, and if the height and length of the frames affect how the model identify all the patterns, as my problem where I don’t know if due to the very small details varying between my images (because are closely the same) could have an impact in the acc and the overfitting.

It will be really nice if you know how to help me!

Thank you, you have a nice books! 🙂

Interesting approach, I would prefer to model the data directly instead of the images. Perhaps with a 1D cnn.

A good approach to stop overfitting with neural nets is early stopping against a validation dataset.

Keras supports this here:

https://keras.io/callbacks/#earlystopping

Hello Jason,

You help me a lot.

I have a problem here. I have a project use CNN-LSTM model. However, when I use 1D cnn the performance of Maxpooling layer for the filter number is better than Maxpooling layer for the data size. So I have to resize of data after cnn layer by Pernute layer. How do you think about this?

If it results in good performance, then go with it.

Alternately, I wonder if you can explore alternate filter sizes in the CNN layer to change the output size?

Thank you for the answer.

Actually, I have already changed the filter sizes multiple times. I know normally, the Maxpooling layer is applied to reduce the data size not the number of filtes Even keras only support Maxpooling in cnn2D for width or height of data, so I little worry about this.

Greeting Dr.Jason

My thanks to your tutorial. I’ve got some question.

According to your tutorial here https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/

I wonder if I could implement your idea of CNN LSTM with that tutorial? If so, what should I change in code? I am trying to do implement it but somehow I stuck with it.

Also, does it make sense to use this model for classification work?

I would appreciated if you answering back Dr.Jason. Thank you so much.

Yes you could.

I have some tutorials on this scheduled.

Dear Jason

Thanks again for your tutorial

Sorry you don’t have tutorial that include dataset which implement CNN (conv2D) and LSTM together

I believe I have an example in the LSTM book and I have some examples scheduled for the blog soon.

Hi Jason,

I am using GRUs for sequence learning in captioning problem. What is meant by training loss in GRU training ? and my loss starts from 9.### and drops down till 0.29## but if I keep training then it starts ti increase again. Any Idea what makes the loss increase again ?

My loss function is

loss_function = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y_true,

logits=y_pred)

loss_mean = tf.reduce_mean(loss_function)

y_true and y_labels are tokens sequences of the captions

Training loss is a measure of how well your model fits the training data.

You can learn more about cross entropy here:

https://en.wikipedia.org/wiki/Loss_functions_for_classification

Hellooo Mr Brownlee

I see in the comments that you have mentioned that you might investigate the ConvLSTM layer now available in Keras. I first want to think you for such an immense contribution, your blog has been extremely useful to me in understanding LSTMs. It must take a lot of your time to keep up with all these comments on top of providing the content that you do. However, I have read many of your posts but the knowledge I have fails me!

I am hoping to take advantage of the ConvLSTM2D for a video segmentation problem. It is a very different application to the sequence predictions frequented on this blog. I have N videos of 500 frames each and each video corresponds to a single 2D segmentation mask. I think it is a many to one problem:

Input: (N, 500, cols, rows, 1)

Output: (N, 1, cols, rows, 1)

As per your post on how to deal with long sequences, I have adjusted my input to contain sequence “fragments” , for example of 50 time steps so that I now have:

Input: (N, 10, 50, cols, rows, 1)

Output (N, 1, 1, cols, rows, 1)

Which does not work out so well because Keras LSTM expects a 5D array, not 6D. My understanding was that I would be able to feed a single sequence at a time into a stateful LSTM (500 images chopped up into fragments of 50) and that I could some how remember the state across the 500 images in this way in order to make a final prediction before deciding whether to update the gradients or not.

My implementation approach did not work with Input: (10, 50, cols, rows, 1) as here “10” is considered as the number of samples and thus corresponding output is required to be (10, 1, cols, rows, 1) ie. a segmentation mask every 50 frames, which is not what I am looking for.

I can duplicate the segmentation 10 times to produce the desired output but I am not sure that is the right way to go.

Or should I wait for the blog posts?

I do have some posts scheduled using the conv2dlstm for time series, but not video.

Nevertheless, I’d encourage you to get the model working first by any means, then make it work well.

Include all 500 frames of video as time steps or just the first 200. Each video is then a sample, then you can treat the rows, cols and channels like any image.

Once you have the model working, check if you need all frames, maybe only use every 5th or 20th frame or something. Evaluate the impact on the model skill.

Okay thanks. I will see what happens

I tried to apply cnn+lstm+ctc for scanned text line recognition. would you recommend me any source for better understanding?

Nice work!

Perhaps try some searching on scholar.google.com

Awesome article as always. I would like to clear a question that came up. Do convolutionalLSTMs [https://github.com/keras-team/keras/blob/master/examples/conv_lstm.py] mean the same as convolutional neural networks followed by an LSTM. I understand you are trying to extrapolate features using the CNN before passing it on to a LSTM, so it should technically be the same?

No, a ConvLSTM is different from a CNN-LSTM.

A ConvLSTM will perform convolutions as part of the inputs to the LSTM unit.

A CNN-LSTM is a model architecture that has a CNN model for the input and an LSTM model to process input time steps processed by the CNN model.

Hi Jason,

Thank you for the article.

I was hoping to get your inputs and advice on the model I’m trying to build.

The goal of the model is to act as a PoS tagger using a combination of CNN and LSTM.

CNN portion receives as input, word vector representations from a Glove embedding and hopefully learns information about the word/sequence.

BiLSTM will then process the output from CNN.

A TimeDistributed layer is added at the dense layer for prediction.

The model trains without issues but in terms of performance, the metrics are worse than a pure LSTM model.

Am I constructing the model wrongly?

It’s hard for me to say. Develop then evaluate the model, then use that as feedback as to whether the model is constructed well.

Thanks for the reply Jason.

I have a few iterations of the model ranging from 1 CNN layer + 2 BLSTM layers to 3 CNN + 2 BLSTM.

In all cases, just a pure 2 BLSTM model outperforms them.

I’m kind of stuck, not sure if its a CNN or LSTM issue.

Perhaps this will help:

http://machinelearningmastery.com/improve-deep-learning-performance/

Thank you for the reply Jason, I’ll take a look at the post.

Thanks for the Tutorial, I want to ask about your your keras backend, is it tensorflow or Theano? Thanks

I currently use and recommend TensorFlow, but sometimes it can be challenging to install on some platforms, in that case I recommend Theano.

how do we feed the video frames as input to cnn+lstm model? Im currently working with that and unaware of how this could be done.Could you guide me on this?Basically i want to know regarding the input part of the model.

Each image is one step in a sequence of images (e.g. time steps), each sample is one sequence of images.