Keras uses fast symbolic mathematical libraries as a backend, such as TensorFlow and Theano.

A downside of using these libraries is that the shape and size of your data must be defined once up front and held constant regardless of whether you are training your network or making predictions.

On sequence prediction problems, it may be desirable to use a large batch size when training the network and a batch size of 1 when making predictions in order to predict the next step in the sequence.

In this tutorial, you will discover how you can address this problem and even use different batch sizes during training and predicting.

After completing this tutorial, you will know:

- How to design a simple sequence prediction problem and develop an LSTM to learn it.
- How to vary an LSTM configuration for online and batch-based learning and predicting.
- How to vary the batch size used for training from that used for predicting.

Let’s get started.

## Tutorial Overview

This tutorial is divided into 6 parts, as follows:

- On Batch Size
- Sequence Prediction Problem Description
- LSTM Model and Varied Batch Size
- Solution 1: Online Learning (Batch Size = 1)
- Solution 2: Batch Forecasting (Batch Size = N)
- Solution 3: Copy Weights

### Tutorial Environment

A Python 2 or 3 environment is assumed to be installed and working.

This includes SciPy with NumPy and Pandas. Keras version 2.0 or higher must be installed with either the TensorFlow or Keras backend.

For help setting up your Python environment, see the post:

### Need help with LSTMs for Sequence Prediction?

Take my free 7-day email course and discover 6 different LSTM architectures (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## On Batch Size

A benefit of using Keras is that it is built on top of symbolic mathematical libraries such as TensorFlow and Theano for fast and efficient computation. This is needed with large neural networks.

A downside of using these efficient libraries is that you must define the scope of your data upfront and for all time. Specifically, the batch size.

The batch size limits the number of samples to be shown to the network before a weight update can be performed. This same limitation is then imposed when making predictions with the fit model.

Specifically, the batch size used when fitting your model controls how many predictions you must make at a time.

This is often not a problem when you want to make the same number predictions at a time as the batch size used during training.

This does become a problem when you wish to make fewer predictions than the batch size. For example, you may get the best results with a large batch size, but are required to make predictions for one observation at a time on something like a time series or sequence problem.

This is why it may be desirable to have a different batch size when fitting the network to training data than when making predictions on test data or new input data.

In this tutorial, we will explore different ways to solve this problem.

## Sequence Prediction Problem Description

We will use a simple sequence prediction problem as the context to demonstrate solutions to varying the batch size between training and prediction.

A sequence prediction problem makes a good case for a varied batch size as you may want to have a batch size equal to the training dataset size (batch learning) during training and a batch size of 1 when making predictions for one-step outputs.

The sequence prediction problem involves learning to predict the next step in the following 10-step sequence:

1 |
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] |

We can create this sequence in Python as follows:

1 2 3 |
length = 10 sequence = [i/float(length) for i in range(length)] print(sequence) |

Running the example prints our sequence:

1 |
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] |

We must convert the sequence to a supervised learning problem. That means when 0.0 is shown as an input pattern, the network must learn to predict the next step as 0.1.

We can do this in Python using the Pandas *shift()* function as follows:

1 2 3 4 5 6 7 8 9 10 |
from pandas import concat from pandas import DataFrame # create sequence length = 10 sequence = [i/float(length) for i in range(length)] # create X/y pairs df = DataFrame(sequence) df = concat([df, df.shift(1)], axis=1) df.dropna(inplace=True) print(df) |

Running the example shows all input and output pairs.

1 2 3 4 5 6 7 8 9 |
1 0.1 0.0 2 0.2 0.1 3 0.3 0.2 4 0.4 0.3 5 0.5 0.4 6 0.6 0.5 7 0.7 0.6 8 0.8 0.7 9 0.9 0.8 |

We will be using a recurrent neural network called a long short-term memory network to learn the sequence. As such, we must transform the input patterns from a 2D array (1 column with 9 rows) to a 3D array comprised of [*rows, timesteps, columns*] where timesteps is 1 because we only have one timestep per observation on each row.

We can do this using the NumPy function *reshape()* as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
from pandas import concat from pandas import DataFrame # create sequence length = 10 sequence = [i/float(length) for i in range(length)] # create X/y pairs df = DataFrame(sequence) df = concat([df, df.shift(1)], axis=1) df.dropna(inplace=True) # convert to LSTM friendly format values = df.values X, y = values[:, 0], values[:, 1] X = X.reshape(len(X), 1, 1) print(X.shape, y.shape) |

Running the example creates *X* and *y* arrays ready for use with an LSTM and prints their shape.

1 |
(9, 1, 1) (9,) |

## LSTM Model and Varied Batch Size

In this section, we will design an LSTM network for the problem.

The training batch size will cover the entire training dataset (batch learning) and predictions will be made one at a time (one-step prediction). We will show that although the model learns the problem, that one-step predictions result in an error.

We will use an LSTM network fit for 1000 epochs.

The weights will be updated at the end of each training epoch (batch learning) meaning that the batch size will be equal to the number of training observations (9).

For these experiments, we will require fine-grained control over when the internal state of the LSTM is updated. Normally LSTM state is cleared at the end of each batch in Keras, but we can control it by making the LSTM stateful and calling *model.reset_state()* to manage this state manually. This will be needed in later sections.

The network has one input, a hidden layer with 10 units, and an output layer with 1 unit. The default tanh activation functions are used in the LSTM units and a linear activation function in the output layer.

A mean squared error optimization function is used for this regression problem with the efficient ADAM optimization algorithm.

The example below configures and creates the network.

1 2 3 4 5 6 7 8 9 |
# configure network n_batch = len(X) n_epoch = 1000 n_neurons = 10 # design network model = Sequential() model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') |

We will fit the network to all of the examples each epoch and reset the state of the network at the end of each epoch manually.

1 2 3 4 |
# fit network for i in range(n_epoch): model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False) model.reset_states() |

Finally, we will forecast each step in the sequence one at a time.

This requires a batch size of 1, that is different to the batch size of 9 used to fit the network, and will result in an error when the example is run.

1 2 3 4 5 6 |
# online forecast for i in range(len(X)): testX, testy = X[i], y[i] testX = testX.reshape(1, 1, 1) yhat = model.predict(testX, batch_size=1) print('>Expected=%.1f, Predicted=%.1f' % (testy, yhat)) |

Below is the complete code example.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
from pandas import DataFrame from pandas import concat from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # create sequence length = 10 sequence = [i/float(length) for i in range(length)] # create X/y pairs df = DataFrame(sequence) df = concat([df, df.shift(1)], axis=1) df.dropna(inplace=True) # convert to LSTM friendly format values = df.values X, y = values[:, 0], values[:, 1] X = X.reshape(len(X), 1, 1) # configure network n_batch = len(X) n_epoch = 1000 n_neurons = 10 # design network model = Sequential() model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') # fit network for i in range(n_epoch): model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False) model.reset_states() # online forecast for i in range(len(X)): testX, testy = X[i], y[i] testX = testX.reshape(1, 1, 1) yhat = model.predict(testX, batch_size=1) print('>Expected=%.1f, Predicted=%.1f' % (testy, yhat)) |

Running the example fits the model fine and results in an error when making a prediction.

The error reported is as follows:

1 |
ValueError: Cannot feed value of shape (1, 1, 1) for Tensor 'lstm_1_input:0', which has shape '(9, 1, 1)' |

## Solution 1: Online Learning (Batch Size = 1)

One solution to this problem is to fit the model using online learning.

This is where the batch size is set to a value of 1 and the network weights are updated after each training example.

This can have the effect of faster learning, but also adds instability to the learning process as the weights widely vary with each batch.

Nevertheless, this will allow us to make one-step forecasts on the problem. The only change required is setting *n_batch* to 1 as follows:

1 |
n_batch = 1 |

The complete code listing is provided below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
from pandas import DataFrame from pandas import concat from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # create sequence length = 10 sequence = [i/float(length) for i in range(length)] # create X/y pairs df = DataFrame(sequence) df = concat([df, df.shift(1)], axis=1) df.dropna(inplace=True) # convert to LSTM friendly format values = df.values X, y = values[:, 0], values[:, 1] X = X.reshape(len(X), 1, 1) # configure network n_batch = 1 n_epoch = 1000 n_neurons = 10 # design network model = Sequential() model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') # fit network for i in range(n_epoch): model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False) model.reset_states() # online forecast for i in range(len(X)): testX, testy = X[i], y[i] testX = testX.reshape(1, 1, 1) yhat = model.predict(testX, batch_size=1) print('>Expected=%.1f, Predicted=%.1f' % (testy, yhat)) |

Running the example prints the 9 expected outcomes and the correct predictions.

1 2 3 4 5 6 7 8 9 |
>Expected=0.0, Predicted=0.0 >Expected=0.1, Predicted=0.1 >Expected=0.2, Predicted=0.2 >Expected=0.3, Predicted=0.3 >Expected=0.4, Predicted=0.4 >Expected=0.5, Predicted=0.5 >Expected=0.6, Predicted=0.6 >Expected=0.7, Predicted=0.7 >Expected=0.8, Predicted=0.8 |

## Solution 2: Batch Forecasting (Batch Size = N)

Another solution is to make all predictions at once in a batch.

This would mean that we could be very limited in the way the model is used.

We would have to use all predictions made at once, or only keep the first prediction and discard the rest.

We can adapt the example for batch forecasting by predicting with a batch size equal to the training batch size, then enumerating the batch of predictions, as follows:

1 2 3 4 |
# batch forecast yhat = model.predict(X, batch_size=n_batch) for i in range(len(y)): print('>Expected=%.1f, Predicted=%.1f' % (y[i], yhat[i])) |

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
from pandas import DataFrame from pandas import concat from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # create sequence length = 10 sequence = [i/float(length) for i in range(length)] # create X/y pairs df = DataFrame(sequence) df = concat([df, df.shift(1)], axis=1) df.dropna(inplace=True) # convert to LSTM friendly format values = df.values X, y = values[:, 0], values[:, 1] X = X.reshape(len(X), 1, 1) # configure network n_batch = len(X) n_epoch = 1000 n_neurons = 10 # design network model = Sequential() model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') # fit network for i in range(n_epoch): model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False) model.reset_states() # batch forecast yhat = model.predict(X, batch_size=n_batch) for i in range(len(y)): print('>Expected=%.1f, Predicted=%.1f' % (y[i], yhat[i])) |

Running the example prints the expected and correct predicted values.

1 2 3 4 5 6 7 8 9 |
>Expected=0.0, Predicted=0.0 >Expected=0.1, Predicted=0.1 >Expected=0.2, Predicted=0.2 >Expected=0.3, Predicted=0.3 >Expected=0.4, Predicted=0.4 >Expected=0.5, Predicted=0.5 >Expected=0.6, Predicted=0.6 >Expected=0.7, Predicted=0.7 >Expected=0.8, Predicted=0.8 |

## Solution 3: Copy Weights

A better solution is to use different batch sizes for training and predicting.

The way to do this is to copy the weights from the fit network and to create a new network with the pre-trained weights.

We can do this easily enough using the *get_weights()* and *set_weights()* functions in the Keras API, as follows:

1 2 3 4 5 6 7 8 9 |
# re-define the batch size n_batch = 1 # re-define model new_model = Sequential() new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) new_model.add(Dense(1)) # copy weights old_weights = model.get_weights() new_model.set_weights(old_weights) |

This creates a new model that is compiled with a batch size of 1. We can then use this new model to make one-step predictions:

1 2 3 4 5 6 |
# online forecast for i in range(len(X)): testX, testy = X[i], y[i] testX = testX.reshape(1, 1, 1) yhat = new_model.predict(testX, batch_size=n_batch) print('>Expected=%.1f, Predicted=%.1f' % (testy, yhat)) |

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 |
from pandas import DataFrame from pandas import concat from keras.models import Sequential from keras.layers import Dense from keras.layers import LSTM # create sequence length = 10 sequence = [i/float(length) for i in range(length)] # create X/y pairs df = DataFrame(sequence) df = concat([df, df.shift(1)], axis=1) df.dropna(inplace=True) # convert to LSTM friendly format values = df.values X, y = values[:, 0], values[:, 1] X = X.reshape(len(X), 1, 1) # configure network n_batch = 3 n_epoch = 1000 n_neurons = 10 # design network model = Sequential() model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adam') # fit network for i in range(n_epoch): model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False) model.reset_states() # re-define the batch size n_batch = 1 # re-define model new_model = Sequential() new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True)) new_model.add(Dense(1)) # copy weights old_weights = model.get_weights() new_model.set_weights(old_weights) # compile model new_model.compile(loss='mean_squared_error', optimizer='adam') # online forecast for i in range(len(X)): testX, testy = X[i], y[i] testX = testX.reshape(1, 1, 1) yhat = new_model.predict(testX, batch_size=n_batch) print('>Expected=%.1f, Predicted=%.1f' % (testy, yhat)) |

Running the example prints the expected, and again correctly predicted, values.

1 2 3 4 5 6 7 8 9 |
>Expected=0.0, Predicted=0.0 >Expected=0.1, Predicted=0.1 >Expected=0.2, Predicted=0.2 >Expected=0.3, Predicted=0.3 >Expected=0.4, Predicted=0.4 >Expected=0.5, Predicted=0.5 >Expected=0.6, Predicted=0.6 >Expected=0.7, Predicted=0.7 >Expected=0.8, Predicted=0.8 |

## Summary

In this tutorial, you discovered how you can work around the need to vary the batch size used for training and prediction with the same network.

Specifically, you learned:

- How to design a simple sequence prediction problem and develop an LSTM to learn it.
- How to vary an LSTM configuration for online and batch-based learning and predicting.
- How to vary the batch size used for training from that used for predicting.

Do you have any questions about batch size?

Ask your questions in the comments below and I will do my best to answer.

Good tip. It is also useful to create another model just for evaluation of test dataset to compare RMSE between train/test.

What do you mean exactly Sam?

Could you explain the dimensions of the weight matrix for this model? Just curious and want to know. I am trying to understand how Keras stores weights.

You can print it out after compiling the model as follows:

Hello, could you explain why you redefine the n_batch = 1 to 1? I thought it should be a different value, no?

In which case Logan?

I think he means line 18 and 31 in the last complete example. Line 18 should be the following or ? n_batch=len(X)

Hi,Dr.Jason Brownlee .Could you tell me the keras version you use in this example,I try to copy the code to run in my mac,but it doesn’t work.

Keras version 2.0 or higher.

Hi, Jason Brownlee. Could you explain why you define the n_batch=1 in the line 18 of the last example? I think n_batch should be assigned with other values.

I have tried to redefine n_batch=len(X), train the model, and copy weights to the new model “new_model”. But I did get the right prediction result. Could you please help to find the reason?

>Expected=0.0, Predicted=0.0

>Expected=0.1, Predicted=0.1

>Expected=0.2, Predicted=0.3

>Expected=0.3, Predicted=0.5

>Expected=0.4, Predicted=0.8

>Expected=0.5, Predicted=1.1

>Expected=0.6, Predicted=1.4

>Expected=0.7, Predicted=1.7

>Expected=0.8, Predicted=2.1

The following is the code I used, which is same as the last example except the line 18

from pandas import DataFrame

from pandas import concat

from keras.models import Sequential

from keras.layers import Dense

from keras.layers import LSTM

# create sequence

length = 10

sequence = [i/float(length) for i in range(length)]

# create X/y pairs

df = DataFrame(sequence)

df = concat([df, df.shift(1)], axis=1)

df.dropna(inplace=True)

# convert to LSTM friendly format

values = df.values

X, y = values[:, 0], values[:, 1]

X = X.reshape(len(X), 1, 1)

# configure network

n_batch = len(X)

n_epoch = 1000

n_neurons = 10

# design network

model = Sequential()

model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))

model.add(Dense(1))

model.compile(loss=’mean_squared_error’, optimizer=’adam’)

# fit network

for i in range(n_epoch):

model.fit(X, y, epochs=1, batch_size=n_batch, verbose=1, shuffle=False)

model.reset_states()

# re-define the batch size

n_batch = 1

# re-define model

new_model = Sequential()

new_model.add(LSTM(n_neurons, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))

new_model.add(Dense(1))

# copy weights

old_weights = model.get_weights()

new_model.set_weights(old_weights)

# compile model

new_model.compile(loss=’mean_squared_error’, optimizer=’adam’)

# online forecast

for i in range(len(X)):

testX, testy = X[i], y[i]

testX = testX.reshape(1, 1, 1)

yhat = new_model.predict(testX, batch_size=n_batch)

print(‘>Expected=%.1f, Predicted=%.1f’ % (testy, yhat))

In testing I found that while a batch size of 1 looked like it was learning a pattern: http://tinyurl.com/ycutvy7h, a larger batch size often converges to a static prediction, much like linear regression (batch Size 5): http://tinyurl.com/yblzfus4

That said, the latter example (batch size 5) actually has a lower RMSE. It leaves me wondering if I actually have something learnable here, or if the flat line indicates no pattern beyond a regression?

Interesting.

Some problems may not benefit from a complex model like an LSTM. That is why we baseline using simple methods – to see if we can add value.

More complex/flexible is not always better.

Technically my problem might be a classification problem in that I really want to know, “Will tomorrow’s move be up or down?” Yet it’s not in the sense that magnitude matters. e.g. the following examples all have the same reward: a) correctly predicting an “up” tomorrow where truth was +6, b) predicting an “up” on 3 days where truth was +2, c) predicting “down” on two days that truth was -3.

Thoughts on the best model for that kind of problem? Considering reinforcement learning next, e.g. https://github.com/matthiasplappert/keras-rl.

Brainstorm all possible framings, then evaluate each.

I’ve seen extremely few examples of practitioners taking the DQN RL concept beyond Gym gaming examples. Do you see any reason why time series forecasting could not be looked at like an Atari game, where the observed game state is replaced with our time series observations and we ask the agent to forecast (play one of two paddle positions) for tomorrow being “up” or “down” as described above? Does that sound like an incremental advance beyond what we’re doing in your more regression-oriented approach taken in this post?

You could try it, I have not thought deeply about the suitability of DRL on time series sorry.

Very good post thank you Jason! I had this problem yesterday and your blog helped me solve it.

I think you should reset n_batch to a different value than 1 in the third solution as brought up by @Zhiyu Wang Because you redefine to 1 later in the code (line 31) so you didn’t end up having different batch size between training and predicting.

Thanks Saad, I see. I’ve updated the example to *actually* use different batch sizes in the final example!