Long Short-Term Networks or LSTMs are a popular and powerful type of Recurrent Neural Network, or RNN.

They can be quite difficult to configure and apply to arbitrary sequence prediction problems, even with well defined and “easy to use” interfaces like those provided in the Keras deep learning library in Python.

One reason for this difficulty in Keras is the use of the TimeDistributed wrapper layer and the need for some LSTM layers to return sequences rather than single values.

In this tutorial, you will discover different ways to configure LSTM networks for sequence prediction, the role that the TimeDistributed layer plays, and exactly how to use it.

After completing this tutorial, you will know:

- How to design a one-to-one LSTM for sequence prediction.
- How to design a many-to-one LSTM for sequence prediction without the TimeDistributed Layer.
- How to design a many-to-many LSTM for sequence prediction with the TimeDistributed Layer.

Let’s get started.

## Tutorial Overview

This tutorial is divided into 5 parts; they are:

- TimeDistributed Layer
- Sequence Learning Problem
- One-to-One LSTM for Sequence Prediction
- Many-to-One LSTM for Sequence Prediction (without TimeDistributed)
- Many-to-Many LSTM for Sequence Prediction (with TimeDistributed)

### Environment

This tutorial assumes a Python 2 or Python 3 development environment with SciPy, NumPy, and Pandas installed.

The tutorial also assumes scikit-learn and Keras v2.0+ are installed with either the Theano or TensorFlow backend.

For help setting up your Python environment, see the post:

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with code).

Click to sign-up and also get a free PDF Ebook version of the course.

## TimeDistributed Layer

LSTMs are powerful, but hard to use and hard to configure, especially for beginners.

An added complication is the TimeDistributed Layer (and the former *TimeDistributedDense* layer) that is cryptically described as a layer wrapper:

This wrapper allows us to apply a layer to every temporal slice of an input.

How and when are you supposed to use this wrapper with LSTMs?

The confusion is compounded when you search through discussions about the wrapper layer on the Keras GitHub issues and StackOverflow.

For example, in the issue “When and How to use TimeDistributedDense,” fchollet (Keras’ author) explains:

TimeDistributedDense applies a same Dense (fully-connected) operation to every timestep of a 3D tensor.

This makes perfect sense if you already understand what the TimeDistributed layer is for and when to use it, but is no help at all to a beginner.

This tutorial aims to clear up confusion around using the TimeDistributed wrapper with LSTMs with worked examples that you can inspect, run, and play with to help your concrete understanding.

## Sequence Learning Problem

We will use a simple sequence learning problem to demonstrate the TimeDistributed layer.

In this problem, the sequence [0.0, 0.2, 0.4, 0.6, 0.8] will be given as input one item at a time and must be in turn returned as output, one item at a time.

Think of it as learning a simple echo program. We give 0.0 as input, we expect to see 0.0 as output, repeated for each item in the sequence.

We can generate this sequence directly as follows:

1 2 3 4 |
from numpy import array length = 5 seq = array([i/float(length) for i in range(length)]) print(seq) |

Running this example prints the generated sequence:

1 |
[ 0. 0.2 0.4 0.6 0.8] |

The example is configurable and you can play with longer/shorter sequences yourself later if you like. Let me know about your results in the comments.

## One-to-One LSTM for Sequence Prediction

Before we dive in, it is important to show that this sequence learning problem can be learned piecewise.

That is, we can reframe the problem into a dataset of input-output pairs for each item in the sequence. Given 0, the network should output 0, given 0.2, the network must output 0.2, and so on.

This is the simplest formulation of the problem and requires the sequence to be split into input-output pairs and for the sequence to be predicted one step at a time and gathered outside of the network.

The input-output pairs are as follows:

1 2 3 4 5 6 |
X, y 0.0, 0.0 0.2, 0.2 0.4, 0.4 0.6, 0.6 0.8, 0.8 |

The input for LSTMs must be three dimensional. We can reshape the 2D sequence into a 3D sequence with 5 samples, 1 time step, and 1 feature. We will define the output as 5 samples with 1 feature.

1 2 |
X = seq.reshape(5, 1, 1) y = seq.reshape(5, 1) |

We will define the network model as having 1 input with 1 time step. The first hidden layer will be an LSTM with 5 units. The output layer with be a fully-connected layer with 1 output.

The model will be fit with efficient ADAM optimization algorithm and the mean squared error loss function.

The batch size was set to the number of samples in the epoch to avoid having to make the LSTM stateful and manage state resets manually, although this could just as easily be done in order to update weights after each sample is shown to the network.

The complete code listing is provided below:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from numpy import array from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # prepare sequence length = 5 seq = array([i/float(length) for i in range(length)]) X = seq.reshape(len(seq), 1, 1) y = seq.reshape(len(seq), 1) # define LSTM configuration n_neurons = length n_batch = length n_epoch = 1000 # create LSTM model = Sequential() model.add(LSTM(n_neurons, input_shape=(1, 1))) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') print(model.summary()) # train LSTM model.fit(X, y, epochs=n_epoch, batch_size=n_batch, verbose=2) # evaluate result = model.predict(X, batch_size=n_batch, verbose=0) for value in result: print('%.1f' % value) |

Running the example first prints the structure of the configured network.

We can see that the LSTM layer has 140 parameters. This is calculated based on the number of inputs (1) and the number of outputs (5 for the 5 units in the hidden layer), as follows:

1 2 3 4 |
n = 4 * ((inputs + 1) * outputs + outputs^2) n = 4 * ((1 + 1) * 5 + 5^2) n = 4 * 35 n = 140 |

We can also see that the fully connected layer only has 6 parameters for the number of inputs (5 for the 5 inputs from the previous layer), number of outputs (1 for the 1 neuron in the layer), and the bias.

1 2 3 |
n = inputs * outputs + outputs n = 5 * 1 + 1 n = 6 |

1 2 3 4 5 6 7 8 9 10 11 |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 1, 5) 140 _________________________________________________________________ dense_1 (Dense) (None, 1, 1) 6 ================================================================= Total params: 146.0 Trainable params: 146 Non-trainable params: 0.0 _________________________________________________________________ |

The network correctly learns the prediction problem.

1 2 3 4 5 |
0.0 0.2 0.4 0.6 0.8 |

## Many-to-One LSTM for Sequence Prediction (without Â TimeDistributed)

In this section, we develop an LSTM to output the sequence all at once, although without the TimeDistributed wrapper layer.

The input for LSTMs must be three dimensional. We can reshape the 2D sequence into a 3D sequence with 1 sample, 5 time steps, and 1 feature. We will define the output as 1 sample with 5 features.

1 2 |
X = seq.reshape(1, 5, 1) y = seq.reshape(1, 5) |

Immediately, you can see that the problem definition must be slightly adjusted to support a network for sequence prediction without a TimeDistributed wrapper. Specifically, output one vector rather build out an output sequence one step at a time. The difference may sound subtle, but it is important to understanding the role of the TimeDistributed wrapper.

We will define the model as having one input with 5 time steps. The first hidden layer will be an LSTM with 5 units. The output layer is a fully-connected layer with 5 neurons.

1 2 3 4 5 6 |
# create LSTM model = Sequential() model.add(LSTM(5, input_shape=(5, 1))) model.add(Dense(length)) model.compile(loss='mean_squared_error', optimizer='adam') print(model.summary()) |

Next, we fit the model for only 500 epochs and a batch size of 1 for the single sample in the training dataset.

1 2 |
# train LSTM model.fit(X, y, epochs=500, batch_size=1, verbose=2) |

Putting this all together, the complete code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
from numpy import array from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # prepare sequence length = 5 seq = array([i/float(length) for i in range(length)]) X = seq.reshape(1, length, 1) y = seq.reshape(1, length) # define LSTM configuration n_neurons = length n_batch = 1 n_epoch = 500 # create LSTM model = Sequential() model.add(LSTM(n_neurons, input_shape=(length, 1))) model.add(Dense(length)) model.compile(loss='mean_squared_error', optimizer='adam') print(model.summary()) # train LSTM model.fit(X, y, epochs=n_epoch, batch_size=n_batch, verbose=2) # evaluate result = model.predict(X, batch_size=n_batch, verbose=0) for value in result[0,:]: print('%.1f' % value) |

Running the example first prints a summary of the configured network.

We can see that the LSTM layer has 140 parameters as in the previous section.

The LSTM units have been crippled and will each output a single value, providing a vector of 5 values as inputs to the fully connected layer. The time dimension or sequence information has been thrown away and collapsed into a vector of 5 values.

We can see that the fully connected output layer has 5 inputs and is expected to output 5 values. We can account for the 30 weights to be learned as follows:

1 2 3 |
n = inputs * outputs + outputs n = 5 * 5 + 5 n = 30 |

The summary of the network is reported as follows:

1 2 3 4 5 6 7 8 9 10 11 |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 5) 140 _________________________________________________________________ dense_1 (Dense) (None, 5) 30 ================================================================= Total params: 170.0 Trainable params: 170 Non-trainable params: 0.0 _________________________________________________________________ |

The model is fit, printing loss information before finalizing and printing the predicted sequence.

The sequence is reproduced correctly, but as a single piece rather than stepwise through the input data. We may have used a Dense layer as the first hidden layer instead of LSTMs as this usage of LSTMs does not take much advantage of their full capability for sequence learning and processing.

1 2 3 4 5 |
0.0 0.2 0.4 0.6 0.8 |

## Many-to-Many LSTM for Sequence Prediction (with TimeDistributed)

In this section, we will use the TimeDistributed layer to process the output from the LSTM hidden layer.

There are two key points to remember when using the TimeDistributed wrapper layer:

**The input must be (at least) 3D**. This often means that you will need to configure your last LSTM layer prior to your TimeDistributed wrapped Dense layer to return sequences (e.g. set the “return_sequences” argument to “True”).**The output will be 3D**. This means that if your TimeDistributed wrapped Dense layer is your output layer and you are predicting a sequence, you will need to resize your y array into a 3D vector.

We can define the shape of the output as having 1 sample, 5 time steps, and 1 feature, just like the input sequence, as follows:

1 |
y = seq.reshape(1, length, 1) |

We can define the LSTM hidden layer to return sequences rather than single values by setting the “*return_sequences*” argument to true.

1 |
model.add(LSTM(n_neurons, input_shape=(length, 1), return_sequences=True)) |

This has the effect of each LSTM unit returning a sequence of 5 outputs, one for each time step in the input data, instead of single output value as in the previous example.

We also can use the TimeDistributed on the output layer to wrap a fully connected Dense layer with a single output.

1 |
model.add(TimeDistributed(Dense(1))) |

The single output value in the output layer is key. It highlights that we intend to output one time step from the sequence for each time step in the input. It just so happens that we will process 5 time steps of the input sequence at a time.

The TimeDistributed achieves this trick by applying the same Dense layer (same weights) to the LSTMs outputs for one time step at a time. In this way, the output layer only needs one connection to each LSTM unit (plus one bias).

For this reason, the number of training epochs needs to be increased to account for the smaller network capacity. I doubled it from 500 to 1000 to match the first one-to-one example.

Putting this together, the full code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
from numpy import array from keras.models import Sequential from keras.layers import Dense from keras.layers import TimeDistributed from keras.layers import LSTM # prepare sequence length = 5 seq = array([i/float(length) for i in range(length)]) X = seq.reshape(1, length, 1) y = seq.reshape(1, length, 1) # define LSTM configuration n_neurons = length n_batch = 1 n_epoch = 1000 # create LSTM model = Sequential() model.add(LSTM(n_neurons, input_shape=(length, 1), return_sequences=True)) model.add(TimeDistributed(Dense(1))) model.compile(loss='mean_squared_error', optimizer='adam') print(model.summary()) # train LSTM model.fit(X, y, epochs=n_epoch, batch_size=n_batch, verbose=2) # evaluate result = model.predict(X, batch_size=n_batch, verbose=0) for value in result[0,:,0]: print('%.1f' % value) |

Running the example, we can see the structure of the configured network.

We can see that as in the previous example, we have 140 parameters in the LSTM hidden layer.

The fully connected output layer is a very different story. In fact, it matches the one-to-one example exactly. One neuron that has one weight for each LSTM unit in the previous layer, plus one for the bias input.

This does two important things:

- Allows the problem to be framed and learned as it was defined, that is one input to one output, keeping the internal process for each time step separate.
- Simplifies the network by requiring far fewer weights such that only one time step is processed at a time.

The one simpler fully connected layer is applied to each time step in the sequence provided from the previous layer to build up the output sequence.

1 2 3 4 5 6 7 8 9 10 11 |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= lstm_1 (LSTM) (None, 5, 5) 140 _________________________________________________________________ time_distributed_1 (TimeDist (None, 5, 1) 6 ================================================================= Total params: 146.0 Trainable params: 146 Non-trainable params: 0.0 _________________________________________________________________ |

Again, the network learns the sequence.

1 2 3 4 5 |
0.0 0.2 0.4 0.6 0.8 |

We can think of the framing of the problem with time steps and a TimeDistributed layer as a more compact way of implementing the one-to-one network in the first example. It may even be more efficient (space or time wise) at aÂ larger scale.

## Further Reading

Below are some resources and discussions on the TimeDistributed layer you may like to dive in into.

- TimeDistributed Layer in the Keras API
- TimeDistributed code on GitHub
- The difference between ‘Dense’ and ‘TimeDistributedDense’ of ‘Keras’ on StackExchange
- When and How to use TimeDistributedDenseÂ onÂ GitHub

## Summary

In this tutorial, you discovered how to develop LSTM networks for sequence prediction and the role of the TimeDistributed layer.

Specifically, you learned:

- How to design a one-to-one LSTM for sequence prediction.
- How to design a many-to-one LSTM for sequence prediction without the TimeDistributed Layer.
- How to design a many-to-many LSTM for sequence prediction with the TimeDistributed Layer.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer them.

Hi, Jason, nice article on TimeDistributed layer!

Basically, there’re three configurations for X (and thus y):

1. (5,1,1) – 5 batchs, 1 time step, 1 feature/step – result shape (5,1)

2. (1,5,1) – 1 batch, 5 time steps, 1 feature/step – result shape (1,5)

3. (1,1,5) – 1 batch, 1 time step, 5 features/step

in article, you discussed previous 2 configures.

I did experiment of config 3, result same shape (1, 5) as 2 does, ’cause X input only 1 batch (which contains 1 sample, which has 5 features.) this config surely lost time information.

3 differ from 2 in two ways:

1) how we/model frame the problem: sequence should be framed as multi time steps as 2

2) different number of LSTM params: config 2 has 140, while config 3 has 220! (big input vector)

Q:

in section ‘many to one without TimeDistributed’, with config 2, you said “The time dimension or sequence information has been thrown away and collapsed into a vector of 5 values.” — that surprise me a little bit.

– does that mean, for seq-to-seq problem, we should always use TimeDistributed?

– what situation suites config 2 (samples, multi-time-steps, features)?

I guess for sequence-to-vector problem (predict one target one time step), config 2 is fine. But for sequence-to-sequence problem discussed here, config 2 is not the right choice, go TimeDistributed.

Very nice, yes I agree.

Generally, we must model sequences as time steps. BPTT will use the sequence data to estimate the gradient. LSTMs have memory, but we cannot rely on them to remember everything (e.g. sequence length of 1).

We can configure an MLP or LSTM to output a vector. For an LSTM, if we output a vector of n values for one time step, each output is considered by the LSTM as a feature, not a time step. Thus it is a many-to-one architecture. The vector may contain timesteps, but the LSTM is not outputting time steps, it is outputting features.

This is no more or less valid, it may require more weights and may give better or worse performance.

Does that make sense?

Any recommendations when facing a one-to-many problem?

They often need more training than you think and consider using bidirectional inputs and regularization on input connections.

This post is great! Thanks for being about the only person to actually explain simply what the TimeDistributed wrapper is doing.

I tried it out with audio vocal data to attempt generation of new speech. I’d previously got basic results with a plain Dense layer on the output.

With the TimeDistributed the network of lstms learned fast. But the result was just to return a rough version of the seed data inputted during generation. This appears to be modelling the equality function, when what I expected was something resembling the sequence following the seed.

My X input is an array of batches, timesteps, and vocal properties. Just a longer version of your example. My y output for measuring error is effectively the same data, just one timestamp later for each batch (time sequence).

Since your examples are for equality modelling, it’s hard to tell if I’ve missed a concept. Any thoughts on why this seems to generate equality rather than next timesteps, from my basic description?

By the way, my original project without TimeDistributed is found at http://babble-rnn.consected.com in case you’re interested in extra context.

Perhaps you need to fit for longer or require more training data?

I wondered about that. I think my mistake may be simple…

Imagine the sequence I was trying to learn was 1,2,3,4,5,6,7,8 (which I’d normalise in the range 0:1). In the standard Keras LSTM example without TimeDistributed I’d have:

input X[0] = [0,1,2]

output y[0] = [3]

X[1] = [1,2,3]

y[1] = [4]

…

So in the TimeDistributed setup I reported above, I tried:

X[0] = [0,1,2]

y[0] = [1,2,3]

X[1] = [1,2,3]

y[1] = [2,3,4]

…

In other words, I was offsetting the intended output by just a single timestep for each batch to be learned.

But I’m guessing that I should really offset the output to be learned by the full number of timesteps in each batch:

X[0] = [0,1,2]

y[0] = [3,4,5]

X[1] = [1,2,3]

y[1] = [4,5,6]

Is the latter example what I should be doing? Intuitively, this would explain why I was learning something close to equality in my first run. But from multiple readings of your code in the post it is not clear to me that this is the case.

I’m not sure I follow. There are indeed many ways to frame a sequence prediction problem.

The simplest framing is sequence in => sequence out where either in or out could be one or more time steps.

Keep one sequence as one sample if possible.

Hmm, I think I have missed something big here. Please humour and let me try once more.

In the standard LSTM examples on Keras, if I was to learn a long time sequence (for example integers incrementing in the range 1..100000), I would pick a shorter segment of the total sequence to pass to the LSTM (I split my corpus into sub-batches that represent the number of LSTM timesteps), then the output to learn would be just the next item in the sequence. There is no TimeDistributed output, so I get one result to calculate error against.

input set: 1,2,3

desired output: 4

then repeat with other sub-batches in the same way (and Keras scrambles the order), so the next one may be…

input set: 473, 474, 475

desired output: 476

If that makes sense, then allow me to ask simply what the input and output should be for the TimeDistributed setup. Would it be

(option A)

input set: 1, 2, 3

desired output: 2, 3, 4

(option B)

input set: 1, 2, 3

desired output: 4, 5, 6

(option C)

something else entirely.

Am I making more sense now?

Your example shows input set and desired output being the same, which says to me that the net will just learn the equality function. Again, am I missing something?

Thanks again for your help.

Yes, good question.

Option B.

In your first example you have a many-to-one time step predictive model. In option B you have a many-to-many time step predictive model. The TimeDistributed wrapper would allow you to use the same Dense layer to output each time step in the output sequence, in this case one output time step per input time step.

I hope that helps.

Agreed. Jason’s practically the only person to explain all kinds of things, especially with regard to the tricky subject of data dimensions in the various permutations of different types of NN layers. It has gotten to point where I approach any new Keras subject by including his name in the Google search.

Haha, thanks John!

If you ever need help on these types of topics, post or email a question to me. I’ll whip up a post. Sounds like I’ve found a valuable niche ðŸ™‚

Hi Jason, thanks for the great article!

Having looked through this article and forums online, is it correct to say that if we were to do many-to-one prediction (an input vector with an output value), it will be straightforward and faster to just use the Dense layer?

In the case where we want to do many-to-many predictions (multiple input vectors/matrices with an output vector/matrix), TimeDistributed layer should be used instead?

Yes, but in the latter case the dense is wrapped in the timedistributed.

That does, thank you! For some reason I couldn’t get that from your post, so thanks for taking the time to explain in more detail.

Thanks, that cleared up return_sequences for me, but I still don’t fully understand what TimeDistributed does.

In the last example (Many-to-Many): If I change TimeDistributed(Dense(1)) to just Dense(1), neither the output shape nor the number of parameters changes and it works just as well. What is the difference between these two options in this case?

Note the number of weights in the network.

Without the TimeDistributed wrapper, the Dense is connected to the output from each time step. With the wrapper, the same Dense is applied to each time step.

It’s a question of how you want to model the problem. Let the Dense combine the time steps and output a vector or process each time step one at a time.

Does that help?

Thanks for the article. I have the same question though… number of weights are same regardless of Dense is wrapped by TimeDistributed or not. So, what is the difference, and where can I see that?

But we have the same number of weights in a many-to-many model as we did no the one-to-one model.

A better model design for increased model complexity/capability with the same resources.

Does that help?

Hey,

Thanks for amazing tutorial.

This shows simple echo program implementation right ?

I want something like –

Input(For time period 2012-2013) – 1,2,5,3,6,4,7,8,9,5

output(For Time Period 2014) – 1,3,4

The output sequence should be generated based on the input sequence, kindly guide me on that.

Sounds great.

What is the issue exactly? You can use code from the blog directly and adapt it for your problem. Where are you having trouble?

How to create a pyramid of LSTMs. i.e. the input to the first node of 2nd layer LSTM will be output at t1 and t2 of first layer LSTM, similrly 2nd node of the 2nd layer will use t3 and t4 from first layer, and so on..

You mean a stacked LSTM?

Hmm, that’s an interesting layer configuration, I would go with Tensorflow module directly instead of Keras to create such a model, Keras doesn’t have that functionality I guess.

Hello! Is the a way to have DIFFERENT length of input and output-timesteps?

Like, I have series with 100 timesteps in the past and will learn next 10 in the feature?

TimeDistributed requires equal length.

If I output return_sequence=false in the last LSTM and Dense with 10 neurones, would it be the same?

Thanks You!

Sorry, I’m not sure I follow, can you restate your question?

Generally, different numbers of times steps on the input and output are referred to as seq2seq problems and are perhaps best addressed with an encoder-decoder network.

Is the procedure similar when using SimpleRNN?

What do you mean James?

Hello Jason,

Nice article. I was wondering if TimeDistributed layer in Keras is analogous to sequence-to-sequence learning module in Tensorflow. If not, could you point out the distinction between the two?

Thanks.

Sorry, I cannot draw this comparison for you as I am not deeply familiar with the TF code.

Hey Jason. I hope I’m understanding this correctly.

I was trying to model a certain number of days ahead, and found myself frustrated with the fact that I couldn’t just predict one day ahead, then right away use that as part of the sliding input window prior to weights being adjusted – basically I wanted the sliding window to move n days forward using predicted values and only then have gradient descent update weights.

I think this might be the way to do so, but am unsure if I need to wrap every layer in timedistributed or what exactly to do with that.

You can do this, but you will need to create the sliding window yourself and call your model recursively.

Keras will not do this for you with the TimeDistributed layer.

Hi Jason,

Thanks for this post,

I have an Input sequence and output sequence shape as follows:

X_shape: (1, 82600, 1)

Y_shape: (1, 82600, 1)

When I try to use your code for this input and output I get following error:

—————————————————————————

MemoryError Traceback (most recent call last)

in ()

60 # create LSTM

61 model = Sequential()

—> 62 model.add(LSTM(n_neurons, input_shape=(length, 1), return_sequences=True))

63 model.add(TimeDistributed(Dense(1)))

How can I go around this?

Since my length was 82600, according to the code, nb_neurons = 82600

I just reduced the number of neurons to 8260 and the compilation was successful

Since, by default this model is stateless (stateful = True) is not specifically specified, do you think reducing the number of neurons was a right choice or could you suggest some other method.

Note: In my sequence of length 82600, every 10 numbers are dependent on previous 10 numbers.

Yes, that was far too many neurons for the first hidden layer. Nice work.

Perhaps you have too much data to fit into memory.

Perhaps work with a smaller sample?

Perhaps try running on a larger computer like AWS?

Hi Jason, Thanks for responses,

In extension to the question that Phil Ayres asked:

My training data shape is (4096,8) that is 4096 rows and each row has 8 features (8 numeric values).

and the target shape is the same.

Requirement is one entire row is responsible to predict the next row.

example:

input

[1,2,3,4,5,6,7,8]

expected

[100,200,300,400,500,600,700,800]

How can I use time distributed for this kind of data.

Can you please provide an example?

Do I have to call model iteratively, if yes how?

Your problem is seq2seq:

https://machinelearningmastery.com/models-sequence-prediction-recurrent-neural-networks/

You have options:

1. The model can output one output time step for each input time step (e.g. via timedistributed).

2. The model can read the entire input sequence, encode it, then output the entire output sequence (e.g. no timedistributed).

I would recommend trying both approaches and see what works best for your data.

Thank you sir

For the clarification i have a question that i have a bit of confusion on parameters you have explained above.

For example:

3D sequence with 5 samples, 1 time step, and 1 feature. We will define the output as 5 samples with 1 feature.

X = seq.reshape(5, 1, 1)

y = seq.reshape(5, 1)

What are feature and samples in this example?

Let me have one example assuming we are working on images.

Then i have 2 classes: class-1 is running man (a sequence with 100 images) and class-2 is walking man (a sequence with 100 images).

Then:

– sample = 2 means 1 image from class-1 and 1 image from class-2?

or just 2 images from class-1 or class-2?

If batch is set of samples then why we define sample = 5 and again batch =

5 in this example?

If sample = 1 then we can define batch = 1. What are difference?

– time step = 10 means we are taking images (t1-t10) from 100 images of a

class for prediction? and next time step we are talking images (t2-t11) and

(t3-t12) etc?

– is ‘feature’ image dimension or feature map as an output of Conv layer?

it sounds like a dimension in this example, however cnn says it is output of

conv layer while we are defining ‘feature’ for input image.

if my input image has width = 100 and height = 50 then feature = 5000?

I explain the LSTM input format in more detail here:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

Hi Jason, first of all thanks for your wonderful tutorial. However, I found myself a bizarre issue when testing out on your third example, which is Many-to-Many LSTM for Sequence Prediction (with TimeDistributed). When I remove the TimeDistributed wrapping for the dense layer while keeping the return sequence statement true for the LSTM layer, the model’s summary doesn’t seem to change (same param #). I suppose removing the TimeDistributed wrapping for the dense layer implies a huge fully connected layer connecting to all the outputs of all time stamps, whereas wrapping by the TimeDistributed implies a relatively small fully connected layer connecting to the outputs of one time stamp at a time. Any explanations to this problem? Thanks in advance ðŸ™‚

Yes – I have noticed in some of my own experiments that it seems that Dense can now support 3D input without the wrapper.

Support is one thing, behavior is another. Are you saying that a “TimeDistributed(Dense(n)) “layer is no different than a plain “Dense(n)” layer?

It is the same layer, but the wrapper allows the weights required for one time step output from the LSTM to be reused for each time step.

Thanks for this amazing explanation, Jason! I have already put it to the test by creating a “denoiser”, where an image with noise is given as input and a clean version of that image is returned. This is a problem typically solved with the use of autoencoders, which are a complex matter if you ask me. However, I was able to pull this through using this tutorial and that got me thinking: would it be possible to train many-to-many architectures without autoencoders, just by padding input and output sequences to a fixed length? And if yes, would this model work with one-hot encoded vectors? I am not sure how mean squared error calculates results, but would it work with padded, one-hot encoded timestep sequences?

Yes, I don’t see why not.

Hi Jason,

Thank you for this great post.

I’ve read this post three times and the forum discussions, but I still can’t understand how to apply the techniques to the topic I have been working on recently.

Here is the scenario

Input:

[a1,a2,a3,a4,a5,…,aN]

[b1,b2,b3,b4,b5,…,bN]

[c1,c2,c3,c4,c5,….,cN]

Output:

[yN+1,yN+2,yN+3,yN+4,yN+5,….,yM]

( Imagine we have history stock related data (a ,b,c, 3 features of the input). They are all time-series.

And the task is to make predictions ,say, 10 steps ahead ( y of the output ) )

Is this a many to many sequence prediction ?

If it is, then based on the discussions above, I think I would have to use TimeDistributedDense.

However, according to the heavy discussion in the github link below,

https://github.com/fchollet/keras/issues/1029

“For each sample, the input is a sequence (a1,a2,a3,a4…aN) and the output is a sequence (b1,b2,b3,b4…bN) with the same length. bi could be viewed as the label of ai.

Push a1 into a recurrent nn to get output b1. Than push a2 and the hidden output of a1 to get b2…”

It seems that TimeDistributedDense is for sequence labeling, so it is not suitable for my case. Am I right ?

Perhaps this post will help:

https://machinelearningmastery.com/time-series-forecasting-long-short-term-memory-network-python/

Hi Jason

I have my information in the following way:

X Y

========== =============

t1,t2, t3, t4,t5, t6, t7, t8, t9,t10 t11,t12,t13,t14,t15

10,20,20,30,30,40,50,50,50,60 60, 60,70 ,70 ,70

I have 10 most recent time steps to predict the next 5 steps.

In this case from many to many, I could use that method without TimeDistributed for the dense layer ?. Because I understood that in that case the number of dense layer neurons would give me the value for each time step, in this case if I put 5 neurons would have 5 values representing my 5 time steps.

Or maybe I should use the TimeDistributed (DenseLayer) to produce the prediction of the next 5 steps? I’m confused.

I read your post about encoder-decoder. I do not want to use that configuration, I could convert my problem to many to one, and use the only predicted value t11 to predict the next t12 and so on. Does that idea seem right to you?

PS: I also saw an example using repeatvector to be able to do my problem, but I do not know if it is correct.

Perhaps this post will help:

https://machinelearningmastery.com/models-sequence-prediction-recurrent-neural-networks/

Yes, there are multiple ways to formulate the encoder-decoder model, one way with the repeat vector is easy to code and results in good performance.

Sorry, I’m not sure I follow your data.

If you have 5 outputs, you can have a model that outputs a vector of all 5 values or output one at a time using a distributed dense. Why not try both and see which framing of the problem is easier to learn or results in better skill?

Running examples 1 and 2 (just copying your code) returned loss values during training of

`nan`

and then correspondingly I got`nan`

values for the predictions. After some playing around, I found that simply changing the optimizer to`sgd`

fixed the issue. I had first tried different learning rates within an`Adam`

class for optimisation, but it always returned`nan`

s. I can’t from the Keras implementation why this might be the case.I’m glad to hear that.

Hi Jason,

Nice post! Have a question regarding your statement: “The LSTM units have been crippled and will each output a single value, providing a vector of 5 values as inputs to the fully connected layer. The time dimension or sequence information has been thrown away and collapsed into a vector of 5 values.”

What I am understanding from your statement above is that configuration 2 does not give any opportunity for the unrolling. It is almost like an MLP network. Please correct me if I am wrong. Thank you.

Indeed, there is no unrolling when time steps are set to 1.

If my understanding is correct, can you please explain why you have used “LSTM units” and not “LSTM unit”. If all I am doing is taking a 5 length sequence as an input and outputting a 5 length sequence, then why do I need multiple LSTM units? Please explain.

The length of input sequences is unrelated to the number of units in the LSTM layer.

Hloo,

I have a silly doubt that how one-to -one model be a sequence prediction problem because there is no any sequence in input neither any timesteps.

Good question.

If we don’t reset state, there can still be memory from prior I/O, just no BPTT going on.

Small typo “afully-connectedd”

Thanks, fixed.

Hi,

I purchased entire bundle, great stuffs ! I have question regarding LSTM though. I have a times-series multi-label problem that need to be classified. The problem somewhat the same as the paper “LEARNING TO DIAGNOSE WITH LSTM RECURRENT

NEURAL NETWORKS” at https://arxiv.org/pdf/1511.03677.pdf.

At each end of each sequence (says 3 diagnostic events / sequence ) they they calculate losses differently: calculate log-loss at each time-steps vs multi-label target, and then combing with final output vs multi-label target and then mean them out for entire sequence.

How do I implement this in Keras ?

Thanks you in advanced.

Steve

Sorry, I am not familiar with that paper, perhaps contact the authors and ask what type of sequence prediction problem it is.

Hi Jason,

When I tried

# train LSTM

model.fit(X, y, epochs=n_epoch, batch_size=n_batch, verbose=2)

The system create the error:

“TypeError Traceback (most recent call last)

C:\Users\Nguyen Viet Linh\Anaconda3\lib\site-packages\theano\gof\cc.py in compile_args(self)

965 try:

–> 966 ret += x.c_compile_args(c_compiler)

967 except TypeError:

TypeError: c_compile_args() takes 1 positional argument but 2 were given”

Looks like your backend might not be installed correctly.

This tutorial may help:

https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/

Hi Jason

Thanks for your tutorial. That’s amazing!

I have got 2 questions to ask:

* In the last model that uses TimeDistributed layer, the same weights of the dense layer are applied to all the 5 outputs from the LSTM hidden layer. So during training, for the 5 outputs of the dense layer, is the backprop done 5 times from the last output to the first one?

* You said that the number of training epochs needs to be increased to account for the smaller network capacity. Should it be decreased? Because smaller capacity networks need smaller times of training, while bigger capacity networks need bigger times of training?

Thanks a lot!

Yes, the Dense is trained on each output.

Yes. Small is less training, larger is more training.

I have a question for my text generation project!

Can I adopt

y = np_utils.to_categorical(dataY)

to

TimeDistributed model?

TImeDistributed error say that It needs 3-dimensional input and also output.

But np_utils.to_categorial return 2D output (total_words, n_vocabulary) so I can’t use TimeDistributed model.

Your all posts are really helpful for my LSTM projects! I really appreciate with your sharing

No, TimeDistributed is used for sequence output.

Hi Jason

So what if we don’t use TimeDistributed wrapper when a sequence is returned?

model.add(LSTM(n_neurons, input_shape=(length, 1), return_sequences=True))

model.add((Dense(1)))

I just want to know what the connections would be between the two layers.

Thanks!

I believe the Dense will now handle 3D input.

Hi Jason,

Thank you for your great post. I have 2 questions and hope you may address:

1. Are there any specific reasons behind constraining values in an input sequence to be between 0 and 1? I simply replaced the code to generate seq by:

seq = array([float(i) for i in range(length)])

and all models perform poorly, cannot predict the output y correctly for the same setting of n_epochs, or even 2*n_epochs.

2. Why do we need the Dense layer for the Many-to-One model?

I personally thought that, as return_sequences=True is not set, the output of LSTM layer is already in a 5D vector. Thus, it is unclear to me the specific role of the added Dense layer which receives an input in 5D and also outputs a 5D vector? Removing it can save us 30 parameters to be trained. (Please correct me if I miss something important here)

Thank you very much.

Normalizing inputs is a standard practice that improves modeling.

The LSTM returns the final output from the end of the sequence by default. We can return the sequence of outputs and have a dense layer interpret them before outputting a final prediction.

hi jason

i have for example 10 seq of data with the length of 10 char on each one.

like a matric of 10×10

so at first i want o read row by row and in the next layer iwant to read column by column.

is it true if i use time distributed(lstm) at the first layer and simple lstm in the second?

TimeDistributed is really for seq2seq type problems.

in fact i want to know how network deal with data? and how we can determine the way of feeding data to the lstm network.

Ultimately, data is provided one row at a time.

n = 4 * ((inputs + 1) * outputs + outputs^2), where does the 4 come from?

The number of weights in an LSTM unit (I guess, from memory).

Can the Timedistributed layer be applied to predict individual class values (like vocabulary items)? This seems to work well with regression problems, where the mean_squared_error is used, but what if we wanted to output a sequence of vectorized words for example. I’ve tried training a translation model using a padded 3d sequence as X and a padded 3d sequence as y using the sparse_categorical_crossentropy loss function, I had the last Dense layer output as many outputs as the maximum vocabulary. It seemed to output rubbish with less than 20 examples and when I went to 250 examples, the model would only output zeros. What am I doing wrong? Also, is it possible to output variable length vectors? Let’s say I have a sequence X of shape(1, 300, 1). Can I train it to a y vector of shape (1, 50, 1)?

Yes. I have a few examples on the blog for NLP with encoder-decoder LSTMs.

Hi Jason,

Thank you for your post!

I have three questions:

1. Regarding to Many-to-One, the output dimension from the last layer is (1, 5), while the input shape to LSTM is (5, 1). To me, it feels like, the input is a one feature with 5 timesteps data while the prediction output has 5 features with 1 time step… I am confused.

2. What is the difference in the performance of forecasting between Many-to-One and Many-to-Many? According to my understanding, Many-to-Many uses all 5 hidden states of the 5 LSTM cells to make prediction, while Many-to-One only uses the final state.

This post will help you better understand the input shape for LSTMs:

https://machinelearningmastery.com/reshape-input-data-long-short-term-memory-networks-keras/

It is the difference between the models is that of outputting a scalar vs a vector.

Hi Jason,

Your tutorials are game-changing.

To use a TimeDistributed output layer for classifying n different classes, would the Y shape be (samples, timesteps, n), with each label as an n-dimensional one-hot array? Or would the binarized label still be considered one feature?

To match that, should the output layer also be of size n?

Thanks for all of your work!

No, for multi-class classification a TimeDistributed layer would not be required.

You would use n neurons in the output layer and a softmax activation function, where n is the number of class values (factors).

What I mean is, still using the TimeDistributed wrapper to get one output per time_step, but instead of binary classification, having n categories. My input and output layers are, respectively,

LSTM(32, input_shape=(time_step, num_features), return_sequences=True)

and

TimeDistributed(Dense(num_labels, activation=”softmax”))

And it seems to be working well with Y.shape == (samples, timesteps, num_labels)

I see, then each time step would be a one hot encoded vector.

Thanks again, Jason.

Is it possible to apply sample_weight or class_weight to a model with this input?

Keras requires a 2D sample_weight array:

“In order to use timestep-wise sample weighting, you should pass a 2D sample_weight array.”

Reshaping it to be 2D obviously does not match Y.shape of (samples,timesteps,num_labels)

I did not forget to set sample_weight_mode=”temporal”.

With the class_weight approach:

“ValueError:

`class_weight`

not supported for 3+ dimensional targets.”Is it possible to use either of these functions for the timedistributed layer? Do you have to create a custom loss function?

I don’t know sorry, I have not tried.

Thanks Jason, I’ll share what I find!

It works fine.

Thank you very much a great post Jason. I have a problem where the input sequence is longer than the output sequence – input data = (samples, 26 time steps, 20 features) and the target data = (samples, 7 time steps, 1 target). Keras throws and error saying it expects the target data to have a shape of (26, 1) if I feed it the shorter (7, 1) shape in the last TimeDistributed(Dense(1)) layer.

One simple solution would of course be to just grab the first 7 of the 26 steps of the fit. However, I would think that it would be an easier and faster job to fit only 7 steps.

Is it possible to tell Keras that the input sequence is of a different length than the output sequence?

When the input and output sequences have a different lengths, you can use an encoder-decoder architecture:

https://machinelearningmastery.com/encoder-decoder-long-short-term-memory-networks/

Many thanks. I will give that a go.

Thanks for this article, this is very helpful

I am trying to implement a simple sequence classifier. I create a set containing 10 random integers between 0-100 and if that set consists of a number that is a multiple of 10, then y is 1 and if not, then y is 0.

For ex: If X is [10,11,34,56,78,99, 21, 24, 25, 77]. Then y is 1

For ex: if X is [1,11,34,56,78,99, 21, 24, 25, 77], then y is 0

I reused, your code for this purpose, but I am unable to get the correct results. Below is the code, Can you please tell what is wrong with this code? Appreciate your help on this. Thank you!

from random import random

from numpy import array

from numpy import cumsum

from keras.models import Sequential

from keras.layers import LSTM

from keras.layers import Dense

from keras.layers import TimeDistributed

import numpy as np

import random as rd

# create a sequence classification instance

def get_sequence(n_timesteps):

# create a sequence of 10 random numbers in the range [0-100]

X = array([rd.randrange(0, 101, 1) for _ in range(n_timesteps)])

#If the sequence has a number that is a multiple of 10, then Y is 1

#If not, Y is 0 by default

y = 0

for i in range(n_timesteps):

if(X[i] % 10 == 0):

y = 1

break;

#Convert y to a numpy array

y = np.asarray(y)

# reshape input and output data to be suitable for LSTMs

X = X.reshape(1, n_timesteps, 1)

y = y.reshape(1)

return X, y

# define problem properties

n_timesteps = 10

# define LSTM

model = Sequential()

model.add(LSTM(200, input_shape=(n_timesteps, 1)))

model.add(Dense(1, activation=’sigmoid’))

model.compile(loss=’mean_squared_error’, optimizer=’adam’, metrics=[‘acc’])

# train LSTM

for epoch in range(100):

# generate new random sequence

X,y = get_sequence(n_timesteps)

# fit model for one epoch on this sequence

model.fit(X, y, epochs=epoch, batch_size=1, verbose=2)

# evaluate LSTM

#Variable to hold correct predictions

cor_pred = 0

for epoch in range(100):

X,y = get_sequence(n_timesteps)

yhat = model.predict_classes(X, verbose=0)

print(‘Expected:’, y, ‘Predicted’, yhat)

#Check if the predicted vs actual match, if so increment the count of correct predictions

if(y == yhat):

cor_pred = cor_pred + 1

print(“Correct Prediction:”, cor_pred)

Perhaps you need to tune the model for the problem?

Perhaps explore other framings of the problem?

I offer a suite of more ideas here:

http://machinelearningmastery.com/improve-deep-learning-performance/

Hello, this article help me a lot.

Now, I only have one question: how to deal with sequence need padding ?

For example:

I want to train a model to detect wrong word using in an ariticle.

I generate trainning data as below:

corrects = [

['How', 'are', 'you', '!'], # [1, 1, 1, 1]

['Fine', ',', 'thank', 'you', '.'], # [1, 1, 1, 1, 1]

['Do', 'you', 'have', 'meal', '?'], # [1, 1, 1, 1, 1]

...

]

`wrongs = [`

['How', 'were', 'you', '!'], # [1, 0, 1, 1]

['Find', ',', 'thank', 'you', '.'], # [0, 1, 1, 1, 1]

['Did', 'you', 'have', 'meal', '.'], # [0, 1, 1, 1, 0]

...

]

I need handle the wrong word in first, and also want last word of a sentence can affect by first few words of latter sentence.

Some article told me do not backpropagate for the errors of those padded words, by masking their loss.

But I can’t understand what he said, If I pad X, I need change y to same size (so the padding label is a new value, may be

`-1`

? ).Could you give me some tip?

You can use zero-padding to make sequences the same length and a Masking layer to ignore the zeros.

Here’s examples of padding:

https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/

Here’s an example of masking:

https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/

Thank you, I get it.

May I ask one more question?

I’d like to use

`Bidirectional`

,from https://machinelearningmastery.com/develop-bidirectional-lstm-sequence-classification-python-keras/

You use

`Bidirectional`

and`TimeDistributed`

together, is it suit for my wrong word detection task ?You can use them together.

I don’t know about the use on that application, try it and see.

Thank you for the great article Jason. I believe i am dealing with a problem of type that can be solved using “Many-to-Many LSTM for Sequence Prediction (with TimeDistributed)” technique explained above.

In my problem i have more samples and more features compared to the example above but i am trying sequence to sequence with output at every timestep.

One sample in my problem, input(1,5,2)–> output(1,5,1) looks as follows.

[[1,2,3,4,5]

[10,20,30,40,50]] –> [10,40,90,160,250]

Am i correct in trying out the TimeDistributed technique for this problem?

If so, how are batch-size, epochs, number of neurons going to impact the model intuitively?

Is it correct to say that batch-size and epochs together play role in number of iterations which leads to number of times the weights are updated. And number of neurons help the model to remember the sequence?

Do i have to worry about “stateful” in this kind of problem?

You could try with and without it and compare the results.

Model weights are updated after each batch. You can make the model stateful and control when weights are updated instead – if you need that control.

More neurons means more capacity to learn, but harder and slower to train.

Thank you.

Very interesting article!

When you write,

“The LSTM units have been crippled and will each output a single value, providing a vector of 5 values as inputs to the fully connected layer.”,

is this vector is the output of the last LSTM unit?

No, of the network.

I’m struggling to figure out what data property determines what the TimeDistributed(Dense(1)) expects. It never seems to line up for me.

Thanks

What do you mean exactly Jared?