How to Scale Data for Long Short-Term Memory Networks in Python

By Jason Brownlee on August 5, 2019 in Deep Learning for Time Series 80

The data for your sequence prediction problem probably needs to be scaled when training a neural network, such as a Long Short-Term Memory recurrent neural network.

When a network is fit on unscaled data that has a range of values (e.g. quantities in the 10s to 100s) it is possible for large inputs to slow down the learning and convergence of your network and in some cases prevent the network from effectively learning your problem.

In this tutorial, you will discover how to normalize and standardize your sequence prediction data and how to decide which to use for your input and output variables.

After completing this tutorial, you will know:

How to normalize and standardize sequence data in Python.
How to select the appropriate scaling for input and output variables.
Practical considerations when scaling sequence data.

Kick-start your project with my new book Deep Learning for Time Series Forecasting, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Scale Data for Long Short-Term Memory Networks in Python
Photo by Mathias Appel, some rights reserved.

Tutorial Overview

This tutorial is divided into 4 parts; they are:

Scaling Series Data
Scaling Input Variables
Scaling Output Variables
Practical Considerations When Scaling

Scaling Series Data in Python

There are two types of scaling of your series that you may want to consider: normalization and standardization.

These can both be achieved using the scikit-learn library.

Normalize Series Data

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data. If your time series is trending up or down, estimating these expected values may be difficult and normalization may not be the best method to use on your problem.

A value is normalized as follows:

y = (x - min) / (max - min)

1	y = (x - min) / (max - min)

Where the minimum and maximum values pertain to the value x being normalized.

For example, for a dataset, we could guesstimate the min and max observable values as 30 and -10. We can then normalize any value, like 18.8, as follows:

y = (x - min) / (max - min)
y = (18.8 - (-10)) / (30 - (-10))
y = 28.8 / 40
y = 0.72

y = (x - min) / (max - min)

y = (18.8 - (-10)) / (30 - (-10))

y = 28.8 / 40

y = 0.72

You can see that if an x value is provided that is outside the bounds of the minimum and maximum values, that the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

Good practice usage with the MinMaxScaler and other scaling techniques is as follows:

Fit the scaler using available training data. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit() function.
Apply the scale to training data. This means you can use the normalized data to train your model. This is done by calling the transform() function.
Apply the scale to data going forward. This means you can prepare new data in the future on which you want to make predictions.

If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the inverse_transform() function.

Below is an example of normalizing a contrived sequence of 10 quantities.

The scaler object requires data to be provided as a matrix of rows and columns. The loaded time series data is loaded as a Pandas Series.

from pandas import Series
from sklearn.preprocessing import MinMaxScaler
# define contrived series
data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]
series = Series(data)
print(series)
# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))
# train the normalization
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(values)
print('Min: %f, Max: %f' % (scaler.data_min_, scaler.data_max_))
# normalize the dataset and print
normalized = scaler.transform(values)
print(normalized)
# inverse transform and print
inversed = scaler.inverse_transform(normalized)
print(inversed)

from pandas import Series

from sklearn.preprocessing import MinMaxScaler

# define contrived series

data = [10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0, 100.0]

series = Series(data)

print(series)

# prepare data for normalization

values = series.values

values = values.reshape((len(values), 1))

# train the normalization

scaler = MinMaxScaler(feature_range=(0, 1))

scaler = scaler.fit(values)

print('Min: %f, Max: %f' % (scaler.data_min_, scaler.data_max_))

# normalize the dataset and print

normalized = scaler.transform(values)

print(normalized)

# inverse transform and print

inversed = scaler.inverse_transform(normalized)

print(inversed)

Running the example prints the sequence, prints the min and max values estimated from the sequence, prints the same normalized sequence, then the values back in their original scale using the inverse transform.

We can also see that the minimum and maximum values of the dataset are 10.0 and 100.0 respectively.

0     10.0
1     20.0
2     30.0
3     40.0
4     50.0
5     60.0
6     70.0
7     80.0
8     90.0
9    100.0

Min: 10.000000, Max: 100.000000

[[ 0.        ]
 [ 0.11111111]
 [ 0.22222222]
 [ 0.33333333]
 [ 0.44444444]
 [ 0.55555556]
 [ 0.66666667]
 [ 0.77777778]
 [ 0.88888889]
 [ 1.        ]]

[[  10.]
 [  20.]
 [  30.]
 [  40.]
 [  50.]
 [  60.]
 [  70.]
 [  80.]
 [  90.]
 [ 100.]]

0 10.0

1 20.0

2 30.0

3 40.0

4 50.0

5 60.0

6 70.0

7 80.0

8 90.0

9 100.0

Min: 10.000000, Max: 100.000000

[[ 0. ]

[ 0.11111111]

[ 0.22222222]

[ 0.33333333]

[ 0.44444444]

[ 0.55555556]

[ 0.66666667]

[ 0.77777778]

[ 0.88888889]

[ 1. ]]

[[ 10.]

[ 20.]

[ 30.]

[ 40.]

[ 50.]

[ 60.]

[ 70.]

[ 80.]

[ 90.]

[ 100.]]

Need help with Deep Learning for Time Series?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Standardize Series Data

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1.

This can be thought of as subtracting the mean value or centering the data.

Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales.

Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation. You can still standardize your time series data if this expectation is not met, but you may not get reliable results.

Standardization requires that you know or are able to accurately estimate the mean and standard deviation of observable values. You may be able to estimate these values from your training data.

A value is standardized as follows:

y = (x - mean) / standard_deviation

1	y = (x - mean) / standard_deviation

Where the mean is calculated as:

mean = sum(x) / count(x)

1	mean = sum(x) / count(x)

And the standard_deviation is calculated as:

standard_deviation = sqrt( sum( (x - mean)^2 ) / count(x))

1	standard_deviation = sqrt( sum( (x - mean)^2 ) / count(x))

We can guesstimate a mean of 10 and a standard deviation of about 5. Using these values, we can standardize the first value of 20.7 as follows:

y = (x - mean) / standard_deviation
y = (20.7 - 10) / 5
y = (10.7) / 5
y = 2.14

y = (x - mean) / standard_deviation

y = (20.7 - 10) / 5

y = (10.7) / 5

y = 2.14

The mean and standard deviation estimates of a dataset can be more robust to new data than the minimum and maximum.

You can standardize your dataset using the scikit-learn object StandardScaler.

from pandas import Series
from sklearn.preprocessing import StandardScaler
from math import sqrt
# define contrived series
data = [1.0, 5.5, 9.0, 2.6, 8.8, 3.0, 4.1, 7.9, 6.3]
series = Series(data)
print(series)
# prepare data for normalization
values = series.values
values = values.reshape((len(values), 1))
# train the normalization
scaler = StandardScaler()
scaler = scaler.fit(values)
print('Mean: %f, StandardDeviation: %f' % (scaler.mean_, sqrt(scaler.var_)))
# normalize the dataset and print
standardized = scaler.transform(values)
print(standardized)
# inverse transform and print
inversed = scaler.inverse_transform(standardized)
print(inversed)

from pandas import Series

from sklearn.preprocessing import StandardScaler

from math import sqrt

# define contrived series

data = [1.0, 5.5, 9.0, 2.6, 8.8, 3.0, 4.1, 7.9, 6.3]

series = Series(data)

print(series)

# prepare data for normalization

values = series.values

values = values.reshape((len(values), 1))

# train the normalization

scaler = StandardScaler()

scaler = scaler.fit(values)

print('Mean: %f, StandardDeviation: %f' % (scaler.mean_, sqrt(scaler.var_)))

# normalize the dataset and print

standardized = scaler.transform(values)

print(standardized)

# inverse transform and print

inversed = scaler.inverse_transform(standardized)

print(inversed)

Running the example prints the sequence, prints the mean and standard deviation estimated from the sequence, prints the standardized values, then prints the values back in their original scale.

We can see that the estimated mean and standard deviation were about 5.3 and 2.7 respectively.

0    1.0
1    5.5
2    9.0
3    2.6
4    8.8
5    3.0
6    4.1
7    7.9
8    6.3

Mean: 5.355556, StandardDeviation: 2.712568

[[-1.60569456]
 [ 0.05325007]
 [ 1.34354035]
 [-1.01584758]
 [ 1.26980948]
 [-0.86838584]
 [-0.46286604]
 [ 0.93802055]
 [ 0.34817357]]

[[ 1. ]
 [ 5.5]
 [ 9. ]
 [ 2.6]
 [ 8.8]
 [ 3. ]
 [ 4.1]
 [ 7.9]
 [ 6.3]]

0 1.0

1 5.5

2 9.0

3 2.6

4 8.8

5 3.0

6 4.1

7 7.9

8 6.3

Mean: 5.355556, StandardDeviation: 2.712568

[[-1.60569456]

[ 0.05325007]

[ 1.34354035]

[-1.01584758]

[ 1.26980948]

[-0.86838584]

[-0.46286604]

[ 0.93802055]

[ 0.34817357]]

[[ 1. ]

[ 5.5]

[ 9. ]

[ 2.6]

[ 8.8]

[ 3. ]

[ 4.1]

[ 7.9]

[ 6.3]]

Scaling Input Variables

The input variables are those that the network takes on the input or visible layer in order to make a prediction.

A good rule of thumb is that input variables should be small values, probably in the range of 0-1 or standardized with a zero mean and a standard deviation of one.

Whether input variables require scaling depends on the specifics of your problem and of each variable. Let’s look at some examples.

Categorical Inputs

You may have a sequence of categorical inputs, such as letters or statuses.

Generally, categorical inputs are first integer encoded then one hot encoded. That is, a unique integer value is assigned to each distinct possible input, then a binary vector of ones and zeros is used to represent each integer value.

By definition, a one hot encoding will ensure that each input is a small real value, in this case 0.0 or 1.0.

Real-Valued Inputs

You may have a sequence of quantities as inputs, such as prices or temperatures.

If the distribution of the quantity is normal, then it should be standardized, otherwise the series should be normalized. This applies if the range of quantity values is large (10s 100s, etc.) or small (0.01, 0.0001).

If the quantity values are small (near 0-1) and the distribution is limited (e.g. standard deviation near 1) then perhaps you can get away with no scaling of the series.

Other Inputs

Problems can be complex and it may not be clear how to best scale input data.

If in doubt, normalize the input sequence. If you have the resources, explore modeling with the raw data, standardized data, and normalized and see if there is a beneficial difference.

If the input variables are combined linearly, as in an MLP [Multilayer Perceptron], then it is rarely strictly necessary to standardize the inputs, at least in theory. … However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

Scaling Output Variables

The output variable is the variable predicted by the network.

You must ensure that the scale of your output variable matches the scale of the activation function (transfer function) on the output layer of your network.

If your output activation function has a range of [0,1], then obviously you must ensure that the target values lie within that range. But it is generally better to choose an output activation function suited to the distribution of the targets than to force your data to conform to the output activation function.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

The following heuristics should cover most sequence prediction problems:

Binary Classification Problem

If your problem is a binary classification problem, then the output will be class values 0 and 1. This is best modeled with a sigmoid activation function on the output layer. Output values will be real values between 0 and 1 that can be snapped to crisp values.

Multi-class Classification Problem

If your problem is a multi-class classification problem, then the output will be a vector of binary class values between 0 and 1, one output per class value. This is best modeled with a softmax activation function on the output layer. Again, output values will be real values between 0 and 1 that can be snapped to crisp values.

Regression Problem

If your problem is a regression problem, then the output will be a real value. This is best modeled with a linear activation function. If the distribution of the value is normal, then you can standardize the output variable. Otherwise, the output variable can be normalized.

Practical Considerations When Scaling

There are some practical considerations when scaling sequence data.

Estimate Coefficients. You can estimate coefficients (min and max values for normalization or mean and standard deviation for standardization) from the training data. Inspect these first-cut estimates and use domain knowledge or domain experts to help improve these estimates so that they will be usefully correct on all data in the future.
Save Coefficients. You will need to normalize new data in the future in exactly the same way as the data used to train your model. Save the coefficients used to file and load them later when you need to scale new data when making predictions.
Data Analysis. Use data analysis to help you better understand your data. For example, a simple histogram can help you quickly get a feeling for the distribution of quantities to see if standardization would make sense.
Scale Each Series. If your problem has multiple series, treat each as a separate variable and in turn scale them separately.
Scale At The Right Time. It is important to apply any scaling transforms at the right time. For example, if you have a series of quantities that is non-stationary, it may be appropriate to scale after first making your data stationary. It would not be appropriate to scale the series after it has been transformed into a supervised learning problem as each column would be handled differently, which would be incorrect.
Scale if in Doubt. You probably do need to rescale your input and output variables. If in doubt, at least normalize your data.

Summary

In this tutorial, you discovered how to scale your sequence prediction data when working with Long Short-Term Memory recurrent neural networks.

Specifically, you learned:

How to normalize and standardize sequence data in Python.
How to select the appropriate scaling for input and output variables.
Practical considerations when scaling sequence data.

Do you have any questions about scaling sequence prediction data?
Ask your question in the comments and I will do my best to answer.

80 Responses to How to Scale Data for Long Short-Term Memory Networks in Python

Jack Sheffield July 7, 2017 at 6:33 am #

Thanks for the post Jason, nice and succinct walk-through on how to scale data. I wanted to share a great course on Experfy that covers Machine Learning, especially supervised learning that I’ve found super helpful in understanding all of this

Reply
- Jack Sheffield July 7, 2017 at 6:34 am #
  
  Here’s the link! https://www.experfy.com/training/courses/machine-learning-foundations-supervised-learning
  
  Reply
- Jason Brownlee July 9, 2017 at 10:32 am #
  
  Glad to hear it.
  
  Reply
Anthony The Koala July 7, 2017 at 9:35 am #

Dear Dr Jason,
When making predictions using the scaled data, do you have to unscale the data, using the

y (scaled value) = (x-min)/(max-min) y * (max - min) + min = actual value

1
2
3

y (scaled value) = (x-min)/(max-min)

y * (max - min) + min = actual value

OR

y * std_dev + mean = actual value

1

y * std_dev + mean = actual value

Thank you

Reply
- Jason Brownlee July 9, 2017 at 10:34 am #
  
  After the prediction, yes, in order to make use if it or have error scores in the correct scale for apples to apples comparison of models.
  
  Reply
Natallia Lundqvist July 7, 2017 at 10:28 pm #

Hi Jason, thank you once again for sharing your great ideas! I work with seq2seq application on text input of variable length with very large vocabulary (several thousand entries). Obviously, padding and one_hot_encode are necessary in this case. If one uses keras.tokenizer.texts_to_sequences(…) and then keras.tokenizer.sequences_to_matrix(sequence, mode=’binary’), one gets 2D-tensor which can not be fitted directly into LSTMs.

For example:
seq_test = tokenizer.texts_to_sequences(input_text_sequence)

seq_test[:4]
Out[16]: [[1, 2, 110], [23, 5, 150], [1, 3, 17], [8, 2, 218, 332]]

X_test = tokenizer.sequences_to_matrix(seq_test, mode=’binary’)

X_test[:4,:]
Out[18]:
array([[ 0., 1., 1., …, 0., 0., 0.],
[ 0., 0., 0., …, 0., 0., 0.],
[ 0., 1., 0., …, 0., 0., 0.],
[ 0., 0., 1., …, 0., 0., 0.]])

If one tries to pass padded sequence into “sequences_to_matrix”, an error message is generated:

File “C:….\keras\preprocessing\text.py”, line 262, in sequences_to_matrix
if not seq:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

Is it so that one has to do “one_hot_encoding” manually in order to make use of LSTMs in an encode-decode manner???

On the other hand, I get very good convergence (>99%) if I don’t do one_hot_encode and use a network architecture similar to https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

The problem arise with predictions since the last Dense layer has activation=’sigmoid’, which generates values between 0 and 1. How to make predictions in the form (Out[19]: [[1, 2, 110], [23, 5, 150], [1, 3, 17], [8, 2, 218]]) without one_hot_encode input_sequence???

The last question. If one use one_hot_encode of a sequence, embedding layer and convolution layer don’t make sense, right???

Reply
- Jason Brownlee July 9, 2017 at 10:45 am #
  
  LSTM input is 3D [samples, timesteps, features]. Each sample will be one sequence of your input. Time steps are words or chars and features are the one hot encoded values.
  
  Reply
Liviu July 12, 2017 at 2:09 am #

Hello and thank you for the tutorials ! Learned a lot from them.

One question regarding scaling (or normalization): how can we make sure that the scaling results remains the same between different data sets? For example:
– step 1: we use some data sets to train a model (with scaling data) and then we save the trained model for future use.
– step 2: we import the model created at step 1 and used it to predict a prediction data set.

But: the prediction data set must also be scaled. And more than that it must be scaled with the same scaling parameters (scaled “the same way”) used to scale the model trained at step 1. Am I wrong ?
Or we somehow have to save the scaling object also and import it again to be used to scale the prediction data set at step 2 ?

Reply
- Jason Brownlee July 12, 2017 at 9:49 am #
  
  Correct.
  
  It means you must estimate the scaling parameters carefully and save them for future use.
  
  Reply
  - Lukas November 16, 2017 at 6:34 am #
    
    Great articles Jason! Thank you so much for your dedicated work.
    
    I am facing a similar problem to Liviu’s.
    
    How to scale the features and the target in the initial training data, supposed in the future additional data will be available and is to be used to incrementally train the model.
    
    If the initially available data is scaled from 0 to 1 with using the maximum value available in the data, a new maximum would shift the whole scale the model is trained on and would therefore falsify the results.
    
    What I already know for sure is the maximum of the target to be higher in the future due to growth, but the final magnitude is absolutely not assessable.
    
    Do you have any suggestions how to solve this problem without rescaling the whole dataset with the new max and respectively retraining the model on the whole dataset?
    
    Thanks in advance and regards,
    Lukas
    
    Reply
    - Jason Brownlee November 16, 2017 at 10:32 am #
      
      You can use domain knowledge to estimate the extreme min/max values that you are ever expected to see.
      
      Or use the same approach and estimate mean/stdev and standardize the data instead which might be more robust to large changes in scale over time.
      
      Reply
Emmanuel July 31, 2017 at 9:48 am #

Thanks for the good work

Reply
- Jason Brownlee July 31, 2017 at 3:48 pm #
  
  I’m glad you found the post useful Emmanuel.
  
  Reply
Hai Nguyen January 12, 2018 at 2:52 am #

Thank you for your work. If I would like to scale value from pixels of images, how do i do that?

Reply
- Jason Brownlee January 12, 2018 at 5:53 am #
  
  See this post:
  https://machinelearningmastery.com/image-augmentation-deep-learning-keras/
  
  Reply
Giani February 19, 2018 at 11:51 pm #

Hi Jason, thank you a lot for your work!

I wanted to ask you if there is a major motivation to scale data in LSTM rather than in typical MLP NN.

I was using MLP and I was obtaining good results even without scaling input data while facing the same problem by using LSTMs was giving me very bad results (always constant output values).. but now normalizing inputs and outputs of LSTM I am getting even better results than MLPs.

So I was asking myself if there is a theoretical motivation why normalization is more important in LSTM than in MLP.

Thanks again,

Giani

Reply
- Jason Brownlee February 21, 2018 at 6:26 am #
  
  Just an empirical justification, like most of applied machine learning.
  
  Yes, I have seen better results with a rescale between 0-1 than not. Also in making data stationary can help.
  
  My best advice is to test and see.
  
  Reply
Abdur Rehman Nadeem March 1, 2018 at 11:08 pm #

Hi Jason, I have a dataset in which some of the features have very small range say 0.1 – 0.001 and some features have large range say 10 – 1000 , should I normalize these features ?

Reply
- Jason Brownlee March 2, 2018 at 5:32 am #
  
  I would recommend it and see how the treatment impacts model skill.
  
  Reply
Fabian Zimmer March 19, 2018 at 6:08 pm #

Hi Jason,

First of all thanks for this wealth of information that is your blog, it really helped me on my way to setting up a working recurrent neural net for a pet project of mine.

Almost all of it is working, but I am having a problem with scaling the input and output right. My time series is some historical financial data and when scaling each window of this series, so input and output together, it learns incredibly well and predictions (input and output are queried, scaled, then only the input is fed to the net) it performs accurately on never seen before data.

In real life conditions each window will not have an output series part, so I only scale the input. As those ranges are somewhat different from the signals when scaling input and output of each window together, real life predictions are really poor.

MinMax scaling the input, and MinMax scaling the output with the min max values of the input sequence (scaled output can sometimes range from 2 to -1, whereas the input is always between 1 and 0) leads to really long learning times and subpar predictions.

Am I doing something completely wrong? How would you tackle such an issue?

Thanks a lot in advance and thanks again for your time and effort you undoubtedly put into this blog.

Cheers,
Fabian

Reply
- Jason Brownlee March 20, 2018 at 6:13 am #
  
  Perhaps scale manually and select min/max values that you expect will cover all possible observations in the domain for all time.
  
  Reply
  - Fabian Zimmer March 22, 2018 at 7:13 pm #
    
    I tried that approach but the algorithm doesn’t really converge there. It’s a simple sequence of 200 observation points and 48 prediction points, using lstm multicells with t2 loss and rmsprop. Might switch to a conv rnn classifier with one hit encoded percent change in 48 timestep target, as that allows me to scale the observation sequence to -1,1.
    
    Still a little bamboozled, is regression with large variance in the signal that big of a problem?
    
    Reply
    - Jason Brownlee March 23, 2018 at 6:05 am #
      
      LSTM is not really suited for straight regression, it is suited for sequence prediction problems:
      https://machinelearningmastery.com/sequence-prediction/
      
      Reply
      - Fabian Zimmer March 23, 2018 at 6:30 am #
        
        I am attempting to predict a sequence of 48 time steps from a sequence of 240 time step step, though.
        
        If I scale on the entire window, so over 288 time steps, I get amazing training and validation results. But since I can’t do that, as during inference time I only have access to the input, I need to scale only the inputs during training and validation, too. The resulting predictions are poor at best.
      - Jason Brownlee March 23, 2018 at 8:27 am #
        
        Perhaps you can use domain knowledge to estimate the min/max to be expected across all possible times?
Chris J March 19, 2018 at 9:19 pm #

Hi Jason,

Quick question, why do you scale your data to the range (0, 1) when the article you linked to spesificaly recommends scaling to (-1, 1), which will give zero mean?

Reply
- Jason Brownlee March 20, 2018 at 6:18 am #
  
  I have found scaling to [0,1] results in more skillful LSTM models.
  
  Reply
- Fabian Zimmer March 24, 2018 at 9:20 pm #
  
  The signal, financial in nature, can be extremely volatile. What I have settled for, and what seems to give very acceptable results, is to fit the minmax scaler to the observation period with a feature range of -0.8 to 0.8. If the output, transformed with the observation fitted scaler, exceeds -1 or 1, I cap it at that. During prediction I replace those outlier values with null, so that most of the predictions in feature range are 48 time steps long, some might be shorter. Seems to work reasonably well and produces reliable and actionable insights. Thanks again for your help, greatly appreciated.
  
  Reply
Case March 21, 2018 at 2:45 am #

Hi Jason,

This is a great article. I am trying Keras and using some training data I manually generated and labeled. The training data is a permutation of the sequence[0, 1, 2, 3]. The output is 1 or 0. I build a sequential model using Dense layers. What I find out is that if I rescale the training data to, e.g. [0, 0.25, 0.5, 0.75], it gives me worse accuracy. Do you have any idea why this happens?

Reply
- Jason Brownlee March 21, 2018 at 6:40 am #
  
  The model hyperparametres might need tuning for the new scaled inputs?
  
  Reply
Lyndon March 29, 2018 at 2:04 am #

Hi Jason,
I am learning how to use LSTM to predict time series (like stock price prediction). But I have a question about the data scaling.

For training data set, considering min max scaler, we already know the min and max value. But when we apply the model on validation data or test data, how do I scale the data? If we scale them in the same way, I think min and max values contain future information.

Do I misunderstand something?

Thank you!

Reply
- Jason Brownlee March 29, 2018 at 6:37 am #
  
  If you can estimate the min/max for the domain. Then scale the data using this domain knowledge.
  
  Reply
Riccardo April 10, 2018 at 12:35 am #

Hi,

thanks for the post. I still haven’t understood how to deal with both scaling and masking. The two operations clearly do not commute: if I first pad my sequence, say:
[[ 1, 2, 3, 4 ]
[ 1, 2, 3, 0 ]
[ 1, 2, 0, 0 ]
[ 1, 2, 3, 0 ]]
And then rescale, the rescaled array does not have 0s anymore in the positions corresponding to the padded entries. How do I deal with this in Keras?

Cheers,
Riccardo

Reply
- Jason Brownlee April 10, 2018 at 6:21 am #
  
  Perhaps scale first, then pad as a final step prior to modeling.
  
  Reply
maurice June 18, 2018 at 2:56 am #

I am confused a lstm cell is working with sigmoid and tanh, but why do we use standardscaling ?When I use standard scaling, and having the values of range -0.5 to 11, sigmoid and tanh would not work or am I wrong ? how do I choose the activation function for a linear layer ? thank you in advance the master of ML 🙂

Reply
- Jason Brownlee June 18, 2018 at 6:43 am #
  
  It is often useful to first standardize then normalize the raw data.
  
  Reply
Waldemar July 6, 2018 at 7:04 pm #

Hi Jason,

Thanks for useful post. I have a small doubt, though.

Suppose we have an input time series on which we train our model, let’s say the last seven values are [2, 3, 4, 5, 6, 7, 8]. Time series is collecting new data every hour. When we apply normalization, our data will consist of values between 0 and 1, where 0 is ‘replacement’ for 2 (minimum value) and 1 is ‘replacement’ for 8 (maximum value). We train our LSTM model and it work really fine on a test set (extracted from mentioned array). The problem starts when we want to predict future value based on few new numbers, that have appeared during the time. Suppose right now our time series looks like [2, 3, 4, 5, 6, 7, 8, 10, 13, 17]. If we normalize it with an old scaler, last numbers will be higher than one. Hot to handle such situation?

Thanks in advance! 😉

Reply
- Jason Brownlee July 7, 2018 at 6:14 am #
  
  In this case, it is a good idea to estimate the range of values in the domain using domain expertise and use them to scale the data.
  
  Reply
Robin October 5, 2018 at 8:45 pm #

Hey:)
Thanks for your great tutorial. I still have a question about the handling of train data, test data and my new data I use for real predictions.

Lets assume I have a set X which is just an sequence of 1000 time points with 5 features.
I want to have a model that takes a sequence of 10 of those time points and predicts a 1 or 0 for the 11th (binary classification to make it simple).
X would become an array of (100,10,5). Because we have 100 sequences 10 time steps each and 5 features.

Now I’d split up X to training data and test data:

train_x = X[:80]
test_x = X[80:]

How do I normalize it?

lets take both options. first the method x = (x-mean(x))/std(x), second y = (x – min) / (max – min)

For the first option I usually took each feature for train_x and normalized it.
That means that i take all the values for one feature in train_x and normalize them with respect to them self. I did that for every feature. This procedure was then used for the test set as well.

I that the right way? because I belive in rnn’s it should be different to the way of normalizing regular neural nets.

And what do I do with complete new data that I dont want to use for training? Like I had a new sequence and want to run my model on it. Ho do I normalize it here?

Thanks!

Robin

Reply
- Jason Brownlee October 6, 2018 at 5:44 am #
  
  Calculate the min/max based on training data or domain knowledge, then apply the normalization to all samples, train, test, and new data in the future.
  
  Reply
  - Max April 1, 2022 at 3:28 am #
    
    For this you mean, calculate the min/max for each feature (based on training data or domain knowledge), so that you have a scalar for each feature? I.E. I have 13 features, I’ll calc the min / max for each one, create a scalar for each feature, and then apply the appropriate scalar to it’s corresponding feature?
    
    I have a very similar problem to Robin.
    
    Reply
    - James Carmichael April 1, 2022 at 9:06 am #
      
      Hi Max…The following may be of interest to you:
      
      https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
      
      Reply
Luv October 23, 2018 at 4:30 pm #

My input values are from 0 to 7. Do I really need to rescale this>

Reply
- Jason Brownlee October 24, 2018 at 6:24 am #
  
  It is a good idea to rescale values to the range 0-1 for neural networks.
  
  Reply
murat January 26, 2019 at 1:05 pm #

thanks for another great article.

why not “scaling down” but normalizing?
x………………….y
if we have negative values in an array [5, -7, -3, 6, 7, 9, 10]
after scaling down…………………………..[..-0.7……………….1]
if i normalize…………………………………..[…..0………………..1]

in scaling down I keep the weight of -7 (being -0.7) to pull my Fx to West/negative direction
in normalizing -7 becomes 0 and 10 becomes 1….

-7 will have no weight in my equation -0.7x vs 0.x….Meanwhile 10y will be 1.y pulling all the way to East

Reply
- Jason Brownlee January 27, 2019 at 7:39 am #
  
  Normalization refers to rescaling data to a range [0, 1].
  
  Sorry, I don’t follow. Perhaps you can elaborate?
  
  Reply
Daniel February 7, 2019 at 5:15 pm #

Jason,

I just browsed through this post. It looks like you didn’t explain why data scaling matters. I worked on using LSTM to model stock price. The input sequences are from different stocks. Since stocks have different scales(some range between 100 and 200, some range between 1 and 10 for example). When I use original prices as input to LSTM, I get very poor prediction. But by simply applying normalization to each stock, the LSTM model performance jumps instantly. I wonder if you have a good explanation for this.

Thanks,

Daniel

Reply
- Jason Brownlee February 8, 2019 at 7:44 am #
  
  Yes, I explain the need for data scaling in detail in this tutorial:
  https://machinelearningmastery.com/how-to-improve-neural-network-stability-and-modeling-performance-with-data-scaling/
  
  Reply
Charanraj Mohan March 26, 2019 at 10:10 pm #

Hello, How do you scale a weight matrix that is either lower or upper triangular matrix ?

Reply
- Jason Brownlee March 27, 2019 at 8:59 am #
  
  Perhaps this will help:
  https://machinelearningmastery.com/introduction-to-types-of-matrices-in-linear-algebra/
  
  Reply
mfili May 17, 2019 at 4:43 am #

Hi,
a question related to “domain knowledge” and estimating the min and max values: could I set the estimate max and min values to something like max_value=(current_max_value+30000) and min_value=(current_min_value-30000) until I find someone with more domain knowledge?

Reply
- Jason Brownlee May 17, 2019 at 6:01 am #
  
  Perhaps, if 30K bounds makes sense. If it is too large, it would squash away any signal in your data.
  
  Reply
Homi July 22, 2019 at 9:44 pm #

Hey man,
your articles have been a great help. I’m having a doubt, need some guidance.
My LSTM model has 3 variables that need scaling. Now do I use the same scaler object for all of do I make seperate scaler object for each variable.
Appreciate the effort you put in your articles.

Reply
- Jason Brownlee July 23, 2019 at 8:02 am #
  
  I give an example here:
  https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
  
  Reply
Suraj Pawar August 22, 2019 at 6:16 am #

How can I scale the data for my CNN model? I am using Keras and my input is four-dimensional [a,b,c,d]. a = number of training examples, [b,c] = image dimension, d = number of input features. I think MinMaxScalar is only for two-dimensional matrix.

Thank you. Regards,
Suraj Pawar

Reply
- Jason Brownlee August 22, 2019 at 6:34 am #
  
  Perhaps this will help:
  https://machinelearningmastery.com/how-to-develop-convolutional-neural-network-models-for-time-series-forecasting/
  
  Reply
Rariwa September 1, 2019 at 3:23 pm #

Hi Jason,
I have problem with the result where some of predictions are negative which doesn’t make sense in real condition and also the output data has certain positive range. I have the training data and hold out data set (no target column provided). In the first time, I suspect this is due to the input range in hold out data set fall outside in the training dataset. However, I followed your suggestions to normalize use domain knowledge to get min and max value. Unfortunately, I still found some predictions are negative and I checked it’s inputs are good and fall inside the training input range. below is the rnn architecture :

model = Sequential ()
model.add (LSTM ( n_neurons , activation = ‘relu’, inner_activation = ‘hard_sigmoid’ , input_shape =(x_train.shape[1], 4) ))
model.add(Dropout(dropout))
model.add (Dense (output_dim =1, activation = ‘linear’))
model.compile (loss =”mean_squared_error” , optimizer = “adam”)

I tried mutivariate adaptive spline regression resulting some predictions are negative as well.

Please advice
Thank you

Reply
- Jason Brownlee September 2, 2019 at 5:26 am #
  
  Well done on your progress.
  
  Perhaps you can post-process the prediction?
  Perhaps you can try other modeling algorithms and compare results?
  Perhaps you can explore other data preparation schemes and see if they have an effect?
  
  Let me know how you go.
  
  Reply
Igor November 19, 2019 at 5:35 am #

HI Jason,
does selected scaler feature range (-1,1 or 0,1) appy limitation on activation function selection (if data scaled to -1:1, then we cant use sigmoid but only tanh and vice versa?)
thanks. Igor.

Reply
- Jason Brownlee November 19, 2019 at 7:51 am #
  
  Neither, the StandardScaler will standarize data, meaning it will have a mean of 0 and a standard deviation of 1.
  
  StandardScaler is good for tanh, MinMaxScaler (normalization) is good for sigmoid. In general. compare both with/without scaling and also throw relu into the mix.
  
  Reply
Phil December 20, 2019 at 4:43 am #

Hey Jason,
I have a time series with more than 30 features. Say I want to normalize/standardize 15 of them. Can I use the same scaler for all of those 15 features or do I need a separate scaler for each of those features?

In this article https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/ you have 4 different scalers but what about the situation with 10 features that all should be normalized/standardized?

Thank you very much

Reply
- Jason Brownlee December 20, 2019 at 6:52 am #
  
  If each series is a column, then the scaling is applied by one scaler object to each column separately.
  
  Or you can do it manually – or any way you find easiest in your code.
  
  Reply
Pedro Severn March 20, 2020 at 1:09 pm #

Hi and thank you for your material.

I have some questions regarding the scaling or normalization using LSTM NN. What I have done is this:

1. Split my data in train and validation.
2. With a MinMaxScaler(0,1), did fit_transform on the train dataset and then just transform on validation set.
3. Used the scaler for future data.

The issue I have encountered is that, when I have inputs that are greater or lower than the scaler bounds, I get weird predictions. For example, if the scaled input is negative, the prediction is 0, which is wrong. How can I manage this?

The network I’m using is something like this:

model=Sequential()
model.add(LSTM(300, input_shape = (window,features), return_sequences=True))
model.add(Dropout(0.5))
model.add(LSTM(200, input_shape=(window,features), return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(100,kernel_initializer=’uniform’,activation=’relu’))
model.add(Dense(1,kernel_initializer=’uniform’,activation=’relu’))
model.compile(loss=’mse’,optimizer=’adam’)

– Should I change the activation function of dense layers?
– What activation function in the LSTM layers should I use?
– Should I do a live scaling new incoming data in order to prevent this things?

Thank you in advance!

Reply
- Jason Brownlee March 20, 2020 at 1:21 pm #
  
  Perhaps try a range of model configurations and discover what works best for your specific prediction problem.
  
  Reply
  - Pedro Severin March 20, 2020 at 1:49 pm #
    
    Thank you for your reply. What do you mean with “a range of model”?
    
    Reply
    - Jason Brownlee March 21, 2020 at 8:16 am #
      
      Different algorithms.
      
      Reply
sarah May 20, 2020 at 8:17 pm #

Hi Jason,

Could you tell how to scale/normalize multivariate time series for LSTM model, where the input shape is (212,14,5) ?

I am struggling because MinMaxScaler expected <= 2 array shape, is it still valid to reshape the data as follows:

train.reshape(train.shape[0]*train.shape[1]*train.shape[2], train.shape[2]) to fit the scalar, then reshape it to (212,14,5) back?

Do you think this is affecting the validity/reliability of the results? and how to check?

Thanks

Reply
- Jason Brownlee May 21, 2020 at 6:16 am #
  
  Scale each feature separately.
  
  This may help:
  https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/
  
  Reply
Ahmad July 15, 2020 at 8:58 pm #

Hi Jason,

Thanks for this article, I found it very really helpful.

I have a small confusion: Suppose I standardize the input data and use it to train my model. Then I standardize the test data for validation.

My issue is, how do I apply the inverse transform on the predicted data. Do I use the same scaler (for the inverse transform) which I used to standardize the test data?

Reply
- Jason Brownlee July 16, 2020 at 6:35 am #
  
  You’re welcome.
  
  Yes, use the same scaler object to invert the transform as was used to transform in the first place. More here:
  https://machinelearningmastery.com/how-to-save-and-load-models-and-data-preparation-in-scikit-learn-for-later-use/
  
  Reply
Randy November 15, 2020 at 7:18 am #

Excellent article. I have a quick question, you stated that in case we have multiple series, we treat them and standardize them separately. So let’s say I have 7 series in my training dataset and 3 series in my testing dataset. I will first standardize each of 7 series individually, now it is not clear to me how to carry out the transformation to the 3 series on the testing data set since I will have 7 means and 7 std dev from my training set and I do not know how to decide a mean and an std dev to standardize the 3 series on the testing set. Thank you in advance for you help.

Reply
- Jason Brownlee November 15, 2020 at 7:38 am #
  
  Each series is a separate variable that can be scaled separately, requiring its own coefficients (e.g. min and max or mean and standard deviation).
  
  Reply
pouyan November 22, 2020 at 1:54 am #

Hi Jason, my regression model data target is 90% of the time one specified number , so if no model is used and just return that number all the time, the MSE of prediction would be low, but when I train my model using several LSTM and dense layers, it cant even come close to “no model case” accuracy. Is that a way to improve my model on these kinds of target distributions? some kind of data conversion or model improvement i mean. Im badly in need of help.
thanks in advance

Reply
- Jason Brownlee November 22, 2020 at 6:57 am #
  
  Perhaps you can use a cascade of models, first level is classification whether it is “the same number” or not, second level is regression for the “not” case and predicts the specific numeric value.
  
  Reply
Lorenzo Rodríguez April 6, 2021 at 5:16 am #

Hello Jason, great post as always!

I have a question about MinMaxScaler.

What should why do when test dataset have lower values than the training dataset? Train the scaler with all the dataset instead of the train dataset?

I want my feature range to be between 0 and 1 but this situation outcomes negative values, that are not recommended for LSTM neural networks.

Thank you in advanced!

Reply
- Jason Brownlee April 6, 2021 at 5:21 am #
  
  Ideally the training set is representative of all data.
  
  If new data us out of bounds of the training set, you can force values to the known min and max values.
  
  Reply
Keeed January 13, 2022 at 5:11 pm #

For LSTM, when input data is panel data, for example:
customer 1, spend_on_month_1, bonus_on_month_1, spend_on_month_2, bonus_on_month_2,
customer 2, spend_on_month_1, bonus_on_month_1, spend_on_month_2, bonus_on_month_2,
customer 2, spend_on_month_1, bonus_on_month_1, spend_on_month_2, bonus_on_month_2,

This is a (3, 2*2) shaped tabular data, excl. customer id column, and it will be reshaped to (3,2,2) input to LSTM model in a (N, T ,D) format. (3 rows, 2 months, 2 features)

In this case, how to normalise the feature?
– does the normalisation happen on feature level regardless of time? – each column, for example – normalise(spend_on_month_1)
– or happen on feature level across time? – all columns of a given feature across time, for example – normalise([spend_on_month_1, spend_on_month_2])

Keen to understand your idea about this, thanks!

Reply
- James Carmichael February 27, 2022 at 12:43 pm #
  
  Thanks for asking.
  
  I’m eager to help, but I just don’t have the capacity to debug code for you.
  
  I am happy to make some suggestions:
  
  Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
  Consider cutting the problem back to just one or a few simple examples.
  Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
  Consider posting your question and code to StackOverflow.
  
  Reply
Keeed January 13, 2022 at 5:12 pm #

(correct the example here from typo)
For LSTM, when input data is panel data, for example:
customer 1, spend_on_month_1, bonus_on_month_1, spend_on_month_2, bonus_on_month_2,
customer 2, spend_on_month_1, bonus_on_month_1, spend_on_month_2, bonus_on_month_2,
customer 3, spend_on_month_1, bonus_on_month_1, spend_on_month_2, bonus_on_month_2,’

This is a (3, 2*2) shaped tabular data, excl. customer id column, and it will be reshaped to (3,2,2) input to LSTM model in a (N, T ,D) format. (3 rows, 2 months, 2 features)

In this case, how to normalise the feature?
– does the normalisation happen on feature level regardless of time? – each column, for example – normalise(spend_on_month_1)
– or happen on feature level across time? – all columns of a given feature across time, for example – normalise([spend_on_month_1, spend_on_month_2])

Keen to understand your idea about this, thanks!

Reply
- James Carmichael January 15, 2022 at 11:35 am #
  
  Hello Keeed…You may find the following of interest:
  
  https://machinelearningmastery.com/standardscaler-and-minmaxscaler-transforms-in-python/
  
  Reply

Navigation