Using Normalization Layers to Improve Deep Learning Models

You’ve probably been told to standardize or normalize inputs to your model to improve performance. But what is normalization and how can we implement it easily in our deep learning models to improve performance? Normalizing our inputs aims to create a set of features that are on the same scale as each other, which we’ll explore more in this article.

Also, thinking about it, in neural networks, the output of each layer serves as the inputs into the next layer, so a natural question to ask is: If normalizing inputs to the model helps improve model performance, does standardizing the inputs into each layer help to improve model performance too?

The answer most of the time is yes! However, unlike normalizing our inputs to the model as a whole, it is slightly more complicated to normalize the inputs to intermediate layers as the activations are constantly changing. As such, it is infeasible, or at least, computationally expensive to continuously compute statistics over the entire train set over and over again. In this article, we’ll be exploring normalization layers to normalize your inputs to your model as well as batch normalization, a technique to standardize the inputs into each layer across batches.

Let’s get started!

Using Normalization Layers to Improve Deep Learning Models
Photo by Matej. Some rights reserved.

Overview

This tutorial is split into 6 parts; they are:

  • What is normalization and why is it helpful?
  • Using Normalization layer in TensorFlow
  • What is batch normalization and why should we use it?
  • Batch normalization: Under the hood
  • Normalization and Batch Normalization in Action

What is Normalization and Why is It Helpful?

Normalizing a set of data transforms the set of data to be on a similar scale. For machine learning models, our goal is usually to recenter and rescale our data such that is between 0 and 1 or -1 and 1, depending on the data itself. One common way to accomplish this is to calculate the mean and the standard deviation on the set of data and transform each sample by subtracting the mean and dividing by the standard deviation, which is good if we assume that the data follows a normal distribution as this method helps us standardize the data and achieve a standard normal distribution.

Normalization can help training of our neural networks as the different features are on a similar scale, which helps to stabilize the gradient descent step, allowing us to use larger learning rates or help models converge faster for a given learning rate.

Using Normalization Layer in Tensorflow

To normalize inputs in TensorFlow, we can use Normalization layer in Keras. First, let’s define some sample data,

Then we initialize our Normalization layer.

And then to get the mean and standard deviation of the dataset and set our Normalization layer to use those parameters, we can call Normalization.adapt() method on our data.

For this case, we used expand_dims to add an extra dimension as the Normalization layer normalizes along the last dimension by default (each index in the last dimension gets its own mean and variance parameters computed on the train set) as that is assumed to be the feature dimension, which for RGB images is usually just the different color dimensions.

And then to normalize our data, we can call normalization layer on that data, as such:

which gives the output

And we can verify that this is the expected behavior by running np.mean and np.std on our original data which gives us a mean of 2.0 and a standard deviation of 0.8165. With the input value of $$-1$$, we have $$(-1-2)/0.8165 = -1.2247$$.

Now that we’ve seen how to normalize our inputs, let’s take a look at another normalization method, batch normalization.

What is batch normalization and why should we use it?

Source: https://arxiv.org/pdf/1803.08494.pdf

From the name, you can probably guess that batch normalization must have something to do with batches during training. Simply put, batch normalization standardizes the input of a layer across a single batch.

You might be thinking, why can’t we just calculate the mean and variance at a given layer and normalize it that way? The problem comes when we train our model as the parameters change during training, hence activations in the intermediate layers are constantly changing and calculating mean and variance across the entire training set for each iteration would be time consuming and potentially pointless since the activations are going to change at each iteration anyway. That’s where batch normalization comes in.

Introduced in “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” by Ioffe and Szegedy, batch normalization looks at standardizing the inputs to a layer in order to reduce the problem of internal covariate shift. In the paper, internal covariate shift is defined as the problem of “the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change.”

The idea of batch normalization fixing the problem of internal covariate shift has been disputed, notably in “How Does Batch Normalization Help Optimization?” by Santurkar, et al. where it was proposed that batch normalization helps to smoothen the loss function over the parameter space instead. While it might not always be clear how batch normalization does it, but it has achieved good empirical results on many different problems and models.

There is also some evidence that batch normalization can contribute significantly to addressing the vanishing gradient problem common with deep learning models. In the original ResNet paper, He, et al. mention in their analysis of ResNet vs plain networks that “backward propagated gradients exhibit healthy norms with BN (batch normalization)” even in plain networks.

It has also been suggested that batch normalization has other benefits as well such as allowing us to use higher learning rates as batch normalization can help to stabilize parameter growth. It can also help to regularize the model. From the original batch normalization paper,

“When training with Batch Normalization, a training example is seen in conjunction with other examples in the mini-batch, and the training network no longer producing deterministic values for a given training example In our experiments, we found this effect to be advantageous to the generalization of the network”

Batch Normalization: Under the Hood

So, what does batch normalization actually do?

First, we need to calculate batch statistics, in particular, the mean and variance for each of the different activations across a batch. Since each layer’s output serves as an input into the next layer in a neural network, by standardizing the output of the layers, we are also standardizing the inputs to the next layer in our model (though in practice, it was suggested in the original paper to implement batch normalization before the activation function, however there’s some debate over this).

So, we calculate

Sample mean and variance on batch

Then, for each of the activation maps, we normalization each value using the respective statistics

For Convolutional Neural Networks (CNNs) in particular, we calculate these statistics over all locations of the same channel. Hence there will be one $$\hat\mu$$ and $$s^2$$ for each channel, which will be applied to all pixels of the same channel in each sample in the same batch. From the original bath normalization paper,

“For convolutional layers, we additionally want the normalization to obey the convolutional property – so that different elements of the same feature map, at different locations, are normalized in the same way”

Now that we’ve seen how to calculate the normalized activation maps, let’s explore how this can be implemented using Numpy arrays.

Suppose we had these activation maps with all of them representing a single channel,

Then, we want to standardize each element in the activation map across all locations and across the different samples. To standardize, we compute their mean and standard deviation using

which outputs

Then, we can standardize an activation map by doing

and these store the outputs

But we hit a snag when it comes to inference time. What if we don’t have batches of examples at inference time and even if we did, it would still be preferable if the output is computed from the input deterministically. So, we need to calculate a fixed set of parameters to be used at inference time. For this purpose, we store a moving average for the means and variances instead which we use at inference time to compute the outputs of the layers.

However, another problem with simply standardizing the inputs to a model in this way also changes the representational ability of the layers. One example brought up in the batch normalization paper was the sigmoid nonlinear function, where normalizing the inputs would constrain it to the linear regime of the sigmoid function. To address this, another linear layer is added to scale and recenter the values, along with 2 trainable parameters to learn the appropriate scale and center that should be used.

Implementing Batch Normalization in TensorFlow

Now that we understand what goes on with batch normalization under the hood, let’s see how we can use Keras’ batch normalization layer as part of our deep learning models.

To implement batch normalization as part of our deep learning models in Tensorflow, we can use the keras.layers.BatchNormalization layer. Using the Numpy arrays from our previous example, we can implement the BatchNormalization on them.

which gives us the output

By default, the BatchNormalization layer uses a scale of 1 and center of 0 for the linear layer, hence these values are similar to the values that we computed earlier using Numpy functions.

Normalization and Batch Normalization in Action

Now that we’ve seen how to implement the normalization and batch normalization layers in Tensorflow, let’s explore a LeNet-5 model that uses the normalization and batch normalization layers, as well as compare it to a model that does not use either of these layers.

First, let’s get our dataset, we’ll use CIFAR-10 for this example.

Using a LeNet-5 model with ReLU activation,

Training the model gives us the output,

Next, let’s take a look at what happens if we added normalization and batch normalization layers. We usually add layer normalization. Amending our LeNet-5 model,

And running the training again, this time with the normalization layer added.

And we see that the model converges faster and gets a higher validation accuracy.

Plotting the train and validation accuracies of both models,

Train and validation accuracy of LeNet-5

Train and validation accuracy of LeNet-5 with normalization and batch normalization added

Some caution when using batch normalization, it’s generally not advised to use batch normalization together with dropout as batch normalization has a regularizing effect. Also, too small batch sizes might be an issue for batch normalization as the quality of the statistics (mean and variance) calculated is affected by the batch size and very small batch sizes could lead to issues, with the extreme case being one sample have all activations as 0 if looking at simple neural networks. Consider using layer normalization (more resources in further reading section below) if you are considering using small batch sizes.

Here’s the complete code for the model with normalization too.

Further Reading

Papers:

Here are some of the different types of normalization that you can implement in your model:

Tensorflow layers:

Conclusion

In this post, you’ve discovered how normalization and batch normalization works, as well as how to implement them in TensorFlow. You have also seen how using these layers can help to significantly improve the performance of our machine learning models.

Specifically, you’ve learned:

  • What normalization and batch normalization does
  • How to use normalization and batch normalization in TensorFlow
  • Some tips when using batch normalization in your machine learning model

 

 

Develop Better Deep Learning Models Today!

Better Deep Learning

Train Faster, Reduce Overftting, and Ensembles

...with just a few lines of python code

Discover how in my new Ebook:
Better Deep Learning

It provides self-study tutorials on topics like:
weight decay, batch normalization, dropout, model stacking and much more...

Bring better deep learning to your projects!

Skip the Academics. Just Results.

See What's Inside

10 Responses to Using Normalization Layers to Improve Deep Learning Models

  1. Avatar
    Waheed August 1, 2022 at 9:11 pm #

    Dear Teacher,
    It is very helpful and easy to learn, further I want learn about that,
    How we can use financial ratios dataset for accurate model prediction?
    Like Credit Risk (7-14 ratios) of 200 firms for the period of one year (Cross section dataset).
    Need your helpful comments.
    Thank you

  2. Avatar
    Jamison Moody December 16, 2022 at 12:17 pm #

    Thank you, I feel like this finally helped me clearly understand batch vs layer normalization. I didn’t realize that with CNNs, we normalize across each channel separately in the batch. This makes sense, because each channel/feature may have drastically different means and standard deviations, and this is okay. It also makes a lot of sense why we need to store the moving average of the batch norm statistics. I notice that in Keras there is LayerNormalization layer. Is this different than the Normalization layer you have in the code?

  3. Avatar
    Dmitry January 10, 2023 at 4:26 am #

    there is a typo in the plots – it should be accuracy instead of loss

    • Avatar
      James Carmichael January 10, 2023 at 7:56 am #

      Thank you for the feedback Dmitry!

  4. Avatar
    anonymous June 10, 2023 at 3:31 pm #

    you’ve forgot to divide mean and variance by the number of samples by normalization dimension or number of sample – 1 to unbias it.

  5. Avatar
    WaOnEmperoR June 15, 2023 at 1:33 pm #

    Suppose that we have trained our image dataset with Normalization techniques above. At the testing time, should we normalized the test dataset too, subtracting with mean and dividing by variance of training set, manually? Or is it already handled by model.evaluate() method?

Leave a Reply