Last Updated on

Deep learning neural networks learn how to map inputs to outputs from examples in a training dataset.

The weights of the model are initialized to small random values and updated via an optimization algorithm in response to estimates of error on the training dataset.

Given the use of small weights in the model and the use of error between predictions and expected values, the scale of inputs and outputs used to train the model are an important factor. Unscaled input variables can result in a slow or unstable learning process, whereas unscaled target variables on regression problems can result in exploding gradients causing the learning process to fail.

Data preparation involves using techniques such as the normalization and standardization to rescale input and output variables prior to training a neural network model.

In this tutorial, you will discover how to improve neural network stability and modeling performance by scaling data.

After completing this tutorial, you will know:

- Data scaling is a recommended pre-processing step when working with deep learning neural networks.
- Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
- How to apply standardization and normalization to improve the performance of a Multilayer Perceptron model on a regression predictive modeling problem.

Discover how to train faster, reduce overfitting, and make better predictions with deep learning models in my new book, with 26 step-by-step tutorials and full source code.

Let’s get started.

## Tutorial Overview

This tutorial is divided into six parts; they are:

- The Scale of Your Data Matters
- Data Scaling Methods
- Regression Predictive Modeling Problem
- Multilayer Perceptron With Unscaled Data
- Multilayer Perceptron With Scaled Output Variables
- Multilayer Perceptron With Scaled Input Variables

## The Scale of Your Data Matters

Deep learning neural network models learn a mapping from input variables to an output variable.

As such, the scale and distribution of the data drawn from the domain may be different for each variable.

Input variables may have different units (e.g. feet, kilometers, and hours) that, in turn, may mean the variables have different scales.

Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error.

One of the most common forms of pre-processing consists of a simple linear rescaling of the input variables.

— Page 298, Neural Networks for Pattern Recognition, 1995.

A target variable with a large spread of values, in turn, may result in large error gradient values causing weight values to change dramatically, making the learning process unstable.

Scaling input and output variables is a critical step in using neural network models.

In practice it is nearly always advantageous to apply pre-processing transformations to the input data before it is presented to a network. Similarly, the outputs of the network are often post-processed to give the required output values.

— Page 296, Neural Networks for Pattern Recognition, 1995.

### Scaling Input Variables

The input variables are those that the network takes on the input or visible layer in order to make a prediction.

A good rule of thumb is that input variables should be small values, probably in the range of 0-1 or standardized with a zero mean and a standard deviation of one.

Whether input variables require scaling depends on the specifics of your problem and of each variable.

You may have a sequence of quantities as inputs, such as prices or temperatures.

If the distribution of the quantity is normal, then it should be standardized, otherwise the data should be normalized. This applies if the range of quantity values is large (10s, 100s, etc.) or small (0.01, 0.0001).

If the quantity values are small (near 0-1) and the distribution is limited (e.g. standard deviation near 1) then perhaps you can get away with no scaling of the data.

Problems can be complex and it may not be clear how to best scale input data.

If in doubt, normalize the input sequence. If you have the resources, explore modeling with the raw data, standardized data, and normalized data and see if there is a beneficial difference in the performance of the resulting model.

If the input variables are combined linearly, as in an MLP [Multilayer Perceptron], then it is rarely strictly necessary to standardize the inputs, at least in theory. […] However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

### Scaling Output Variables

The output variable is the variable predicted by the network.

You must ensure that the scale of your output variable matches the scale of the activation function (transfer function) on the output layer of your network.

If your output activation function has a range of [0,1], then obviously you must ensure that the target values lie within that range. But it is generally better to choose an output activation function suited to the distribution of the targets than to force your data to conform to the output activation function.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

If your problem is a regression problem, then the output will be a real value.

This is best modeled with a linear activation function. If the distribution of the value is normal, then you can standardize the output variable. Otherwise, the output variable can be normalized.

### Want Better Results with Deep Learning?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

## Data Scaling Methods

There are two types of scaling of your data that you may want to consider: normalization and standardization.

These can both be achieved using the scikit-learn library.

### Data Normalization

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

A value is normalized as follows:

1 |
y = (x - min) / (max - min) |

Where the minimum and maximum values pertain to the value *x* being normalized.

For example, for a dataset, we could guesstimate the min and max observable values as 30 and -10. We can then normalize any value, like 18.8, as follows:

1 2 3 4 |
y = (x - min) / (max - min) y = (18.8 - (-10)) / (30 - (-10)) y = 28.8 / 40 y = 0.72 |

You can see that if an *x* value is provided that is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

Good practice usage with the *MinMaxScaler* and other scaling techniques is as follows:

**Fit the scaler using available training data**. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the*fit()*function.**Apply the scale to training data**. This means you can use the normalized data to train your model. This is done by calling the*transform()*function.**Apply the scale to data going forward**. This means you can prepare new data in the future on which you want to make predictions.

The default scale for the *MinMaxScaler* is to rescale variables into the range [0,1], although a preferred scale can be specified via the “*feature_range*” argument and specify a tuple including the min and the max for all variables.

1 2 |
# create scaler scaler = MinMaxScaler(feature_range=(-1,1)) |

If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the *inverse_transform()* function.

The example below provides a general demonstration for using the *MinMaxScaler* to normalize data.

1 2 3 4 5 6 7 8 9 10 11 12 |
# demonstrate data normalization with sklearn from sklearn.preprocessing import MinMaxScaler # load data data = ... # create scaler scaler = MinMaxScaler() # fit scaler on data scaler.fit(data) # apply transform normalized = scaler.transform(data) # inverse transform inverse = scaler.inverse_transform(normalized) |

You can also perform the fit and transform in a single step using the *fit_transform()* function; for example:

1 2 3 4 5 6 7 8 9 10 |
# demonstrate data normalization with sklearn from sklearn.preprocessing import MinMaxScaler # load data data = ... # create scaler scaler = MinMaxScaler() # fit and transform in one step normalized = scaler.fit_transform(data) # inverse transform inverse = scaler.inverse_transform(normalized) |

### Data Standardization

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. It is sometimes referred to as “*whitening*.”

This can be thought of as subtracting the mean value or centering the data.

Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales.

Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation. You can still standardize your data if this expectation is not met, but you may not get reliable results.

Standardization requires that you know or are able to accurately estimate the mean and standard deviation of observable values. You may be able to estimate these values from your training data.

A value is standardized as follows:

1 |
y = (x - mean) / standard_deviation |

Where the *mean* is calculated as:

1 |
mean = sum(x) / count(x) |

And the *standard_deviation* is calculated as:

1 |
standard_deviation = sqrt( sum( (x - mean)^2 ) / count(x)) |

We can guesstimate a mean of 10 and a standard deviation of about 5. Using these values, we can standardize the first value of 20.7 as follows:

1 2 3 4 |
y = (x - mean) / standard_deviation y = (20.7 - 10) / 5 y = (10.7) / 5 y = 2.14 |

The mean and standard deviation estimates of a dataset can be more robust to new data than the minimum and maximum.

You can standardize your dataset using the scikit-learn object StandardScaler.

1 2 3 4 5 6 7 8 9 10 11 12 |
# demonstrate data standardization with sklearn from sklearn.preprocessing import StandardScaler # load data data = ... # create scaler scaler = StandardScaler() # fit scaler on data scaler.fit(data) # apply transform standardized = scaler.transform(data) # inverse transform inverse = scaler.inverse_transform(standardized) |

You can also perform the fit and transform in a single step using the *fit_transform()* function; for example:

1 2 3 4 5 6 7 8 9 10 |
# demonstrate data standardization with sklearn from sklearn.preprocessing import StandardScaler # load data data = ... # create scaler scaler = StandardScaler() # fit and transform in one step standardized = scaler.fit_transform(data) # inverse transform inverse = scaler.inverse_transform(standardized) |

## Regression Predictive Modeling Problem

A regression predictive modeling problem involves predicting a real-valued quantity.

We can use a standard regression problem generator provided by the scikit-learn library in the make_regression() function. This function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.

We will use this function to define a problem that has 20 input features; 10 of the features will be meaningful and 10 will not be relevant. A total of 1,000 examples will be randomly generated. The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.

1 2 |
# generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) |

Each input variable has a Gaussian distribution, as does the target variable.

We can demonstrate this by creating histograms of some of the input variables and the output variable.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# regression predictive modeling problem from sklearn.datasets import make_regression from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # histograms of input variables pyplot.subplot(211) pyplot.hist(X[:, 0]) pyplot.subplot(212) pyplot.hist(X[:, 1]) pyplot.show() # histogram of target variable pyplot.hist(y) pyplot.show() |

Running the example creates two figures.

The first shows histograms of the first two of the twenty input variables, showing that each has a Gaussian data distribution.

The second figure shows a histogram of the target variable, showing a much larger range for the variable as compared to the input variables and, again, a Gaussian data distribution.

Now that we have a regression problem that we can use as the basis for the investigation, we can develop a model to address it.

## Multilayer Perceptron With Unscaled Data

We can develop a Multilayer Perceptron (MLP) model for the regression problem.

A model will be demonstrated on the raw data, without any scaling of the input or output variables. We expect that model performance will be generally poor.

The first step is to split the data into train and test sets so that we can fit and evaluate a model. We will generate 1,000 examples from the domain and split the dataset in half, using 500 examples for the train and test datasets.

1 2 3 4 |
# split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] |

Next, we can define an MLP model. The model will expect 20 inputs in the 20 input variables in the problem.

A single hidden layer will be used with 25 nodes and a rectified linear activation function. The output layer has one node for the single target variable and a linear activation function to predict real values directly.

1 2 3 4 |
# define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) |

The mean squared error loss function will be used to optimize the model and the stochastic gradient descent optimization algorithm will be used with the sensible default configuration of a learning rate of 0.01 and a momentum of 0.9.

1 2 |
# compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) |

The model will be fit for 100 training epochs and the test set will be used as a validation set, evaluated at the end of each training epoch.

The mean squared error is calculated on the train and test datasets at the end of training to get an idea of how well the model learned the problem.

1 2 3 |
# evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) |

Finally, learning curves of mean squared error on the train and test sets at the end of each training epoch are graphed using line plots, providing learning curves to get an idea of the dynamics of the model while learning the problem.

1 2 3 4 5 6 |
# plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() |

Tying these elements together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# mlp with unscaled data for the regression problem from sklearn.datasets import make_regression from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() |

Running the example fits the model and calculates the mean squared error on the train and test sets.

In this case, the model is unable to learn the problem, resulting in predictions of NaN values. The model weights exploded during training given the very large errors and, in turn, error gradients calculated for weight updates.

1 |
Train: nan, Test: nan |

This demonstrates that, at the very least, some data scaling is required for the target variable.

A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error.

## Multilayer Perceptron With Scaled Output Variables

The MLP model can be updated to scale the target variable.

Reducing the scale of the target variable will, in turn, reduce the size of the gradient used to update the weights and result in a more stable model and training process.

Given the Gaussian distribution of the target variable, a natural method for rescaling the variable would be to standardize the variable. This requires estimating the mean and standard deviation of the variable and using these estimates to perform the rescaling.

It is best practice is to estimate the mean and standard deviation of the training dataset and use these variables to scale the train and test dataset. This is to avoid any data leakage during the model evaluation process.

The scikit-learn transformers expect input data to be matrices of rows and columns, therefore the 1D arrays for the target variable will have to be reshaped into 2D arrays prior to the transforms.

1 2 3 |
# reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) |

We can then create and apply the *StandardScaler* to rescale the target variable.

1 2 3 4 5 6 7 8 |
# created scaler scaler = StandardScaler() # fit scaler on training dataset scaler.fit(trainy) # transform training dataset trainy = scaler.transform(trainy) # transform test dataset testy = scaler.transform(testy) |

Rescaling the target variable means that estimating the performance of the model and plotting the learning curves will calculate an MSE in squared units of the scaled variable rather than squared units of the original scale. This can make interpreting the error within the context of the domain challenging.

In practice, it may be helpful to estimate the performance of the model by first inverting the transform on the test dataset target variable and on the model predictions and estimating model performance using the root mean squared error on the unscaled data. This is left as an exercise to the reader.

The complete example of standardizing the target variable for the MLP on the regression problem is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# mlp with scaled outputs on the regression problem from sklearn.datasets import make_regression from sklearn.preprocessing import StandardScaler from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) # created scaler scaler = StandardScaler() # fit scaler on training dataset scaler.fit(trainy) # transform training dataset trainy = scaler.transform(trainy) # transform test dataset testy = scaler.transform(testy) # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Loss / Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show() |

Running the example fits the model and calculates the mean squared error on the train and test sets.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, the model does appear to learn the problem and achieves near-zero mean squared error, at least to three decimal places.

1 |
Train: 0.003, Test: 0.007 |

A line plot of the mean squared error on the train (blue) and test (orange) dataset over each training epoch is created.

In this case, we can see that the model rapidly learns to effectively map inputs to outputs for the regression problem and achieves good performance on both datasets over the course of the run, neither overfitting or underfitting the training dataset.

It may be interesting to repeat this experiment and normalize the target variable instead and compare results.

## Multilayer Perceptron With Scaled Input Variables

We have seen that data scaling can stabilize the training process when fitting a model for regression with a target variable that has a wide spread.

It is also possible to improve the stability and performance of the model by scaling the input variables.

In this section, we will design an experiment to compare the performance of different scaling methods for the input variables.

The input variables also have a Gaussian data distribution, like the target variable, therefore we would expect that standardizing the data would be the best approach. This is not always the case.

We can compare the performance of the unscaled input variables to models fit with either standardized and normalized input variables.

The first step is to define a function to create the same 1,000 data samples, split them into train and test sets, and apply the data scaling methods specified via input arguments. The *get_dataset()* function below implements this, requiring the scaler to be provided for the input and target variables and returns the train and test datasets split into input and output components ready to train and evaluate a model.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
# prepare dataset with input and output scalers, can be none def get_dataset(input_scaler, output_scaler): # generate dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # scale inputs if input_scaler is not None: # fit scaler input_scaler.fit(trainX) # transform training dataset trainX = input_scaler.transform(trainX) # transform test dataset testX = input_scaler.transform(testX) if output_scaler is not None: # reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) # fit scaler on training dataset output_scaler.fit(trainy) # transform training dataset trainy = output_scaler.transform(trainy) # transform test dataset testy = output_scaler.transform(testy) return trainX, trainy, testX, testy |

Next, we can define a function to fit an MLP model on a given dataset and return the mean squared error for the fit model on the test dataset.

The *evaluate_model()* function below implements this behavior.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# fit and evaluate mse of model on test set def evaluate_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate the model test_mse = model.evaluate(testX, testy, verbose=0) return test_mse |

Neural networks are trained using a stochastic learning algorithm. This means that the same model fit on the same data may result in a different performance.

We can address this in our experiment by repeating the evaluation of each model configuration, in this case a choice of data scaling, multiple times and report performance as the mean of the error scores across all of the runs. We will repeat each run 30 times to ensure the mean is statistically robust.

The *repeated_evaluation()* function below implements this, taking the scaler for input and output variables as arguments, evaluating a model 30 times with those scalers, printing error scores along the way, and returning a list of the calculated error scores from each run.

1 2 3 4 5 6 7 8 9 10 11 |
# evaluate model multiple times with given input and output scalers def repeated_evaluation(input_scaler, output_scaler, n_repeats=30): # get dataset trainX, trainy, testX, testy = get_dataset(input_scaler, output_scaler) # repeated evaluation of model results = list() for _ in range(n_repeats): test_mse = evaluate_model(trainX, trainy, testX, testy) print('>%.3f' % test_mse) results.append(test_mse) return results |

Finally, we can run the experiment and evaluate the same model on the same dataset three different ways:

- No scaling of inputs, standardized outputs.
- Normalized inputs, standardized outputs.
- Standardized inputs, standardized outputs.

The mean and standard deviation of the error for each configuration is reported, then box and whisker plots are created to summarize the error scores for each configuration.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# unscaled inputs results_unscaled_inputs = repeated_evaluation(None, StandardScaler()) # normalized inputs results_normalized_inputs = repeated_evaluation(MinMaxScaler(), StandardScaler()) # standardized inputs results_standardized_inputs = repeated_evaluation(StandardScaler(), StandardScaler()) # summarize results print('Unscaled: %.3f (%.3f)' % (mean(results_unscaled_inputs), std(results_unscaled_inputs))) print('Normalized: %.3f (%.3f)' % (mean(results_normalized_inputs), std(results_normalized_inputs))) print('Standardized: %.3f (%.3f)' % (mean(results_standardized_inputs), std(results_standardized_inputs))) # plot results results = [results_unscaled_inputs, results_normalized_inputs, results_standardized_inputs] labels = ['unscaled', 'normalized', 'standardized'] pyplot.boxplot(results, labels=labels) pyplot.show() |

Tying these elements together, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 |
# compare scaling methods for mlp inputs on regression problem from sklearn.datasets import make_regression from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot from numpy import mean from numpy import std # prepare dataset with input and output scalers, can be none def get_dataset(input_scaler, output_scaler): # generate dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # scale inputs if input_scaler is not None: # fit scaler input_scaler.fit(trainX) # transform training dataset trainX = input_scaler.transform(trainX) # transform test dataset testX = input_scaler.transform(testX) if output_scaler is not None: # reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) # fit scaler on training dataset output_scaler.fit(trainy) # transform training dataset trainy = output_scaler.transform(trainy) # transform test dataset testy = output_scaler.transform(testy) return trainX, trainy, testX, testy # fit and evaluate mse of model on test set def evaluate_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate the model test_mse = model.evaluate(testX, testy, verbose=0) return test_mse # evaluate model multiple times with given input and output scalers def repeated_evaluation(input_scaler, output_scaler, n_repeats=30): # get dataset trainX, trainy, testX, testy = get_dataset(input_scaler, output_scaler) # repeated evaluation of model results = list() for _ in range(n_repeats): test_mse = evaluate_model(trainX, trainy, testX, testy) print('>%.3f' % test_mse) results.append(test_mse) return results # unscaled inputs results_unscaled_inputs = repeated_evaluation(None, StandardScaler()) # normalized inputs results_normalized_inputs = repeated_evaluation(MinMaxScaler(), StandardScaler()) # standardized inputs results_standardized_inputs = repeated_evaluation(StandardScaler(), StandardScaler()) # summarize results print('Unscaled: %.3f (%.3f)' % (mean(results_unscaled_inputs), std(results_unscaled_inputs))) print('Normalized: %.3f (%.3f)' % (mean(results_normalized_inputs), std(results_normalized_inputs))) print('Standardized: %.3f (%.3f)' % (mean(results_standardized_inputs), std(results_standardized_inputs))) # plot results results = [results_unscaled_inputs, results_normalized_inputs, results_standardized_inputs] labels = ['unscaled', 'normalized', 'standardized'] pyplot.boxplot(results, labels=labels) pyplot.show() |

Running the example prints the mean squared error for each model run along the way.

After each of the three configurations have been evaluated 30 times each, the mean errors for each are reported.

Your specific results may vary, but the general trend should be the same as is listed below.

In this case, we can see that as we expected, scaling the input variables does result in a model with better performance. Unexpectedly, better performance is seen using normalized inputs instead of standardized inputs. This may be related to the choice of the rectified linear activation function in the first hidden layer.

1 2 3 4 5 6 7 8 9 |
... >0.010 >0.012 >0.005 >0.008 >0.008 Unscaled: 0.007 (0.004) Normalized: 0.001 (0.000) Standardized: 0.008 (0.004) |

A figure with three box and whisker plots is created summarizing the spread of error scores for each configuration.

The plots show that there was little difference between the distributions of error scores for the unscaled and standardized input variables, and that the normalized input variables result in better performance and more stable or a tighter distribution of error scores.

These results highlight that it is important to actually experiment and confirm the results of data scaling methods rather than assuming that a given data preparation scheme will work best based on the observed distribution of the data.

## Extensions

This section lists some ideas for extending the tutorial that you may wish to explore.

**Normalize Target Variable**. Update the example and normalize instead of standardize the target variable and compare results.**Compared Scaling for Target Variable**. Update the example to compare standardizing and normalizing the target variable using repeated experiments and compare the results.**Other Scales**. Update the example to evaluate other min/max scales when normalizing and compare performance, e.g. [-1, 1] and [0.0, 0.5].

If you explore any of these extensions, I’d love to know.

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Posts

- How to Scale Data for Long Short-Term Memory Networks in Python
- How to Scale Machine Learning Data From Scratch With Python
- How to Normalize and Standardize Time Series Data in Python
- How to Prepare Your Data for Machine Learning in Python with Scikit-Learn

### Books

- Section 8.2 Input normalization and encoding, Neural Networks for Pattern Recognition, 1995.

### API

- sklearn.datasets.make_regression API
- sklearn.preprocessing.MinMaxScaler API
- sklearn.preprocessing.StandardScaler API

### Articles

## Summary

In this tutorial, you discovered how to improve neural network stability and modeling performance by scaling data.

Specifically, you learned:

- Data scaling is a recommended pre-processing step when working with deep learning neural networks.
- Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
- How to apply standardization and normalization to improve the performance of a Multilayer Perceptron model on a regression predictive modeling problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Thank you for this helpful post for beginners!

Could you please provide more details about the steps of “using the root mean squared error on the unscaled data” to interpret the performance in a specific domain?

Would it be like this??

———————————————————–

1. Finalize the model (based on the performance being calculated from the scaled output variable)

2. Make predictions on test set

3. Invert the predictions (to convert them back into their original scale)

4. Calculate the metrics (e.g. RMSE, MAPE)

———————————————————–

Waiting for your reply! Cheers mate!

Correct.

Really nice article! I got Some quick questions,

If I have multiple input columns, each has different value range, might be [0, 1000] or even a one-hot-encoded data, should all be scaled with same method, or it can be processed differently?

For example:

– input A is normalized to [0, 1],

– input B is normalized to [-1, 1],

– input C is standardized,

– one-hot-encoded data is not scaled

Yes, typically it is a good idea to scale all columns to have the same range. Perhaps start with [0,1] and compare others to see if they result in an improvement.

we want standardized inputs, no scaling of outputs,but outputs value is not in (0,1).Are the predictions inaccurate?

I don’t follow, are what predictions accurate?

Hi Jason,

Your experiment is very helpful for me to understand the difference between different methods, actually I have also done similar things. I always standardized the input data. I have compared the results between standardized and standardized targets. The plots shows that with standardized targets, the network seems to work better. However, here I have a question: suppose the standard deviation of my target is 300, then I think the MSE will be strongly decreased after you fixed the standard deviation to 1. So shall we multiply the original std to the MSE in order to get the MSE in the original target value space?

You can invert the standardization, by adding the mean and multiplying by the stdev.

I also have an example here using the sklaern:

https://machinelearningmastery.com/machine-learning-data-transforms-for-time-series-forecasting/

Hi Jason,

My data includes categorical and continued data. Could I transform the categorical data with 1,2,3…into standardized data and put them into the neural network models to make classification? Or do I need to transformr the categorical data with with one-hot coding(0,1)? I have been confused about it. Thanks

Yes, perhaps try it and compare results?

Hi Jason, I have a specific Question regarding the normalization (min-max scaling) of the output value. Usually you are supposed to use normalization only on the training data set and then apply those stats to the validation and test set. Otherwise you would feed the model at training time certain information about the world it shouldn’t have access to. (The Elements of Statistical Learning: Data Mining, Inference, and Prediction p.247)

But for instance, my output value is a single percentage value ranging [0, 100%] and I am using the ReLU activation function in my output layer. I know for sure that in the “real world” regarding my problem statement, that I will get samples ranging form 60 – 100%. But my training sample size is to small and does not contain enough data points including all possible output values. So here comes my question: Should I stay with my initial statement (normalization only on training data set) or should I apply the maximum possible value of 100% to max()-value of the normalization step? The latter would contradict the literature. Best Regards Bart

Correct.

I would recommend a sigmoid activation in the output.

I would then recommend interpreting the 0-1 scale as 60-100 prior to model evaluation.

Does that help?

I’m not quite sure what you mean by your second recommendation. How would I achieve that?

You can project the scale of 0-1 to anything you want, such as 60-100.

First rescale to a number between 0 and 40 (value * 40) then add the min value (+ 60)

result = value * 40 + 60

Dear Jason, thank you for the great article.

I am wondering if there is any advantage using StadardScaler or MinMaxScaler over scaling manually. I could calculate the mean, std or min, max of my training data and apply them with the corresponding formula for standard or minmax scaling.

Would this approach produce the same results as the StadardScaler or MinMaxScaler or are the sklearn scalers special?

Yes, it is reliable bug free code all wrapped up in a single class – making it harder to introduce new bugs.

Same results as manual, if you coded the manual scaling correctly.

Dear Jason,

I have a few questions from section “Data normalization”. You mention that we should estimate the max and min values, and use that to normalize the training set to e.g. [-1,1]. But what if the max and min values are in the validation or test set? Then I might get values e.g. [-1.2, 1.3] in the validation set. Do you consider this to be incorrect or not?

Another approach is then to make sure that the min and max values for all parameters are contained in the training set. What are your thoughts on this? Is this the way to do it? Or should we use the max and min values for all data combined (training, validation and test sets) when normalizing the training set?

For the moment I use the MinMaxScaler and fit_transform on the training set and then apply that scaler on the validation and test set using transform. But I realise that some of my max values are in the validation set. I suppose this is also related to network saturation.

Perhaps estimate the min/max using domain knowledge. If new data exceeded the limits, snap to known limits, or not – test and see how the model is impacted.

Regardless, the training set must be representative of the problem.

Hello Jason, I am a huge fan of your work! Thank you so much for your insightful tutorials. You are a life saver! I have a small question if i may:

I am trying to fit spectrograms in a cnn in order to do some classification tasks. Unfortunately each spectrogram is around (3000,300) array. Is there a way to reduce the dimensionality without losing so much information?

Ouch, perhaps start with simple downsampling and see what effect that has?

Hi Jason,

It was always good and informative to go through your blogs and your interaction with comments by different people all across the globe.

I have question regarding the scaling techniques.

As you explained about scaling :

Case1:

# created scaler

scaler = StandardScaler()

# fit scaler on training dataset

scaler.fit(trainy)

# transform training dataset

trainy = scaler.transform(trainy)

# transform test dataset

testy = scaler.transform(testy)

in this case mean and standard deviation for all train and test remain same.

What i approached is:

case2

# created scaler

scaler_train = StandardScaler()

# fit scaler on training dataset

scaler_train.fit(trainy)

# transform training dataset

trainy = scaler_train.transform(trainy)

# created scaler

scaler_test = StandardScaler()

# fit scaler on training dataset

scaler_test.fit(trainy)

# transform test dataset

testy = scaler_test.transform(testy)

Here the mean and standard deviation in train data and test data are different.so model may find the test data completely unknown and new .rather in first case where mean and standard deviation is same on train and test data that may leads to providing the known test data to model (known in term of same mean and standard deviation treatment).

Jason,can you guide me if my logics is good to go with case2 or shall i consider case1 .

or if logic is wrong you can also say that and explain.

(Also i applied Same for min-max scaling i.e normalization, if i choose this then)

Again thanks Jason for such a nice work !

Happy Learning !!

I recommend fitting the scaler on the training dataset once, then apply it to transform the training dataset and test set.

If you fit the scaler using the test dataset, you will have data leakage and possibly an invalid estimate of model performance.

Hi Jason,

I’m working on sequence2sequence problem. Input’s max and min points are around 500-300, however output’s are 200-0. If I want to normalize them, should I use different scalers? For example:

scx = MinMaxScaler(feature_range = (0, 1))

scy = MinMaxScaler(feature_range = (0, 1))

trainx = scx.fit_transform(trainx)

trainy = scy.fit_transform(trainy)

or should I scale them with same scale like below?

sc = MinMaxScaler(feature_range = (0, 1))

trainx = sc.fit_transform(trainx)

trainy = sc.fit_transform(trainy)

Yes, use a separate transform for inputs and outputs is a good idea. Otherwise have them all as separate columns in the same matrix and use one scaler, but the column order for transform/inverse_transform will always have to be consistent.

Hi Jason,

Confused about one aspect, I have a small NN with 8 independent variables and one dichotomous dependent variable. I have standardized the input variables (the output variable was left untouched). I have both trained and created the final model with the same standardized data. However, the question is, if I want to create a user interface to receive manual inputs, those will no longer be in the standardized format, so what is the best way to proceed?

You must maintain the objects used to prepare the data, or the coefficients used by those objects (mean and stdev) so that you can prepare new data in an identically way to the way data was prepared during training.

Does that help

Thank you, that makes perfect sense.

Hi Jason,

I have built an ANN model and scaled my inputs and outputs before feeding to the network. I measure the performance of the model by r2_score. My output variable is height. My r2_score when the output variable is in metres is .98, but when my output variable is in centi-metres , my r2_score is .91. I have scaled my output too before feeding to the network, why is there a difference in r2_score even because the output variable is scaled before feeding to the network.

Thanks in advance

Good question, this is why it is important to test different scaling approaches in order to discover what works best for a given dataset and model combination.

Hi Jason,

I am working on sequence to data prediction problem wherein i am performing normalization on input and output both.

Once model is trained then to get the actual output in real-time, I have to perform the de-normalization and when I will perform the denorm then error will increase by the same factor I have used for normalization.

Lets consider, norm predicted output is 0.1 and error of the model is 0.01 .

denorm predicted output become 0.1*100 = 10 and after de-normalizing the error will be 0.01*100= 1

So, what will be solution to this eliminate this kind of problem in regression.

Thanks

What problem exactly?

The problem is after de-normalization of the output, the error difference between actual and predicted output is scaled up by the normalization factor (max-min) So, I want to know what can be done to make the error difference same for both de-normized as well as normalized output.

Thanks

I don’t understand, sorry.

Hi Jason,

Do I have to use only one normalization formula for all inputs?

For example: I have 5 inputs [inp1, inp2, inp3, inp4, inp5] where I can estimate max and min only for [inp1, inp2]. So can I use

y = (x – min) / (max – min)

for [inp1, inp2] and

y = x/(1+x)

for [inp3, inp4, inp5]?

Yes, it is applied to each input separately – assuming they have different units.