The post 8 Tricks for Configuring Backpropagation to Train Better Neural Networks appeared first on Machine Learning Mastery.

]]>Neural network models are trained using stochastic gradient descent and model weights are updated using the backpropagation algorithm.

The optimization solved by training a neural network model is very challenging and although these algorithms are widely used because they perform so well in practice, there are no guarantees that they will converge to a good model in a timely manner.

The challenge of training neural networks really comes down to the challenge of configuring the training algorithms.

In this post, you will discover tips and tricks for getting the most out of the backpropagation algorithm when training neural network models.

After reading this post, you will know:

- The challenge of training a neural network is really the balance between learning the training dataset and generalizing to new examples beyond the training dataset.
- Eight specific tricks that you can use to train better neural network models, faster.
- Second order optimization algorithms that can also be used to train neural networks under certain circumstances.

Let’s get started.

This tutorial is divided into five parts; they are:

- Efficient BackProp Overview
- Learning and Generalization
- 8 Practical Tricks for Backpropagation
- Second Order Optimization Algorithms
- Discussion and Conclusion

The 1998 book titled “Neural Networks: Tricks of the Trade” provides a collection of chapters by academics and neural network practitioners that describe best practices for configuring and using neural network models.

The book was updated at the cusp of the deep learning renaissance and a second edition was released in 2012 including 13 new chapters.

The first chapter in both editions is titled “*Efficient BackProp*” written by Yann LeCun, Leon Bottou, (both at Facebook AI), Genevieve Orr, and Klaus-Robert Muller (also co-editors of the book).

The chapter is also available online for free as a pre-print.

- Efficient BackProp, Preprint, 1998.

The chapter was also summarized in a preface in both editions of the book titled “*Speed Learning*.”

It is an important chapter and document as it provides a near-exhaustive summary of how to best configure backpropagation under stochastic gradient descent as of 1998, and much of the advice is just as relevant today.

In this post, we will focus on this chapter or paper and attempt to distill the most relevant advice for modern deep learning practitioners.

For reference, the chapter is divided into 10 sections; they are:

- 1.1: Introduction
- 1.2: Learning and Generalization
- 1.3: Standard Backpropagation
- 1.4: A Few Practical Tricks
- 1.5: Convergence of Gradient Descent
- 1.6: Classical Second Order Optimization Methods
- 1.7: Tricks to Compute the Hessian Information in Multilayer Networks
- 1.8: Analysis of the Hessian in Multi-layer Networks
- 1.9: Applying Second Order Methods to Multilayer Networks
- 1.10: Discussion and Conclusion

We will focus on the tips and tricks for configuring backpropagation and stochastic gradient descent.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The chapter begins with a description of the general problem of the dual challenge of learning and generalization with neural network models.

The authors motivate the article by highlighting that the backpropagation algorithm is the most widely used algorithm to train neural network models because it works and because it is efficient.

Backpropagation is a very popular neural network learning algorithm because it is conceptually simple, computationally efficient, and because it often works. However, getting it to work well, and sometimes to work at all, can seem more of an art than a science.

The authors also remind us that training neural networks with backpropagation is really hard. Although the algorithm is both effective and efficient, it requires the careful configuration of multiple model properties and model hyperparameters, each of which requires deep knowledge of the algorithm and experience to set correctly.

And yet, there are no rules to follow to “*best*” configure a model and training process.

Designing and training a network using backprop requires making many seemingly arbitrary choices such as the number and types of nodes, layers, learning rates, training and test sets, and so forth. These choices can be critical, yet there is no foolproof recipe for deciding them because they are largely problem and data dependent.

The goal of training a neural network model is most challenging because it requires solving two hard problems at once:

**Learning**the training dataset in order to best minimize the loss.**Generalizing**the model performance in order to make predictions on unseen examples.

There is a trade-off between these concerns, as a model that learns too well will generalize poorly, and a model that generalizes well may be underfit. The goal of training a neural network well is to find a happy balance between these two concerns.

This chapter is focused on strategies for improving the process of minimizing the cost function. However, these strategies must be used in conjunction with methods for maximizing the network’s ability to generalize, that is, to predict the correct targets for patterns the learning system has not previously seen.

Interestingly, the problem of training a neural network model is cast in terms of the bias-variance trade-off, often used to describe machine learning algorithms in general.

When fitting a neural network model, these terms can be defined as:

**Bias**: A measure of how the network output averaged across all datasets differs from the desired function.**Variance**: A measure of how much the network output varies across datasets.

This framing casts defining the capacity of the model as a choice of bias, controlling the range of functions that can be learned. It casts variance as a function of the training process and the balance struck between overfitting the training dataset and generalization error.

This framing can also help in understanding the dynamics of model performance during training. That is, from a model with large bias and small variance in the beginning of training to a model with lower bias and higher variance at the end of training.

Early in training, the bias is large because the network output is far from the desired function. The variance is very small because the data has had little influence yet. Late in training, the bias is small because the network has learned the underlying function.

These are the normal dynamics of the model, although when training, we must guard against training the model too much and overfitting the training dataset. This makes the model fragile, pushing the bias down, specializing the model to training examples and, in turn, causing much larger variance.

However, if trained too long, the network will also have learned the noise specific to that dataset. This is referred to as overtraining. In such a case, the variance will be large because the noise varies between datasets.

A focus on the backpropagation algorithm means a focus on “*learning*” at the expense of temporally ignoring “*generalization*” that can be addressed later with the introduction of regularization techniques.

A focus on learning means a focus on minimizing loss both quickly (fast learning) and effectively (learning well).

The idea of this chapter, therefore, is to present minimization strategies (given a cost function) and the tricks associated with increasing the speed and quality of the minimization.

The focus of the chapter is a sequence of practical tricks for backpropagation to better train neural network models.

There are eight tricks; they are:

- 1.4.1: Stochastic Versus Batch Learning
- 1.4.2: Shuffling the Examples
- 1.4.3: Normalizing the Inputs
- 1.4.4: The Sigmoid
- 1.4.5: Choosing Target Values
- 1.4.6: Initializing the Weights
- 1.4.7: Choosing Learning Rates
- 1.4.8: Radial Basis Function vs Sigmoid

The section starts off with a comment that the optimization problem that we are trying to solve with stochastic gradient descent and backpropagation is challenging.

Backpropagation can be very slow particularly for multilayered networks where the cost surface is typically non-quadratic, non-convex, and high dimensional with many local minima and/or flat regions.

The authors go on to highlight that in choosing stochastic gradient descent and the backpropagation algorithms to optimize and update weights, we have no grantees of performance.

There is no formula to guarantee that (1) the network will converge to a good solution, (2) convergence is swift, or (3) convergence even occurs at all.

These comments provide the context for the tricks that also make no guarantees but instead increase the likelihood of finding a better model, faster.

Let’s take a closer look at each trick in turn.

Many of the tricks are focused on sigmoid (s-shaped) activation functions, which are no longer best practice for use in hidden layers, having been replaced by the rectified linear activation function. As such, we will spend less time on sigmoid-related tricks.

This tip highlights the choice between using either stochastic or batch gradient descent when training your model.

Stochastic gradient descent, also called online gradient descent, refers to a version of the algorithm where the error gradient is estimated from a single randomly selected example from the training dataset and the model parameters (weights) are then updated.

It has the effect of training the model fast, although it can result in large, noisy updates to model weights.

Stochastic learning is generally the preferred method for basic backpropagation for the following three reasons:

1. Stochastic learning is usually much faster than batch learning.

2. Stochastic learning also often results in better solutions.

3. Stochastic learning can be used for tracking changes.

Batch gradient descent involves estimating the error gradient using the average from all examples in the training dataset. It is faster to execute and is better understood from a theoretical perspective, but results in slower learning.

Despite the advantages of stochastic learning, there are still reasons why one might consider using batch learning:

1. Conditions of convergence are well understood.

2. Many acceleration techniques (e.g. conjugate gradient) only operate in batch learning.

3. Theoretical analysis of the weight dynamics and convergence rates are simpler.

Generally, the authors recommend using stochastic gradient descent where possible because it offers faster training of the model.

Despite the advantages of batch updates, stochastic learning is still often the preferred method particularly when dealing with very large data sets because it is simply much faster.

They suggest making use of a learning rate decay schedule in order to counter the noisy effect of the weight updates seen during stochastic gradient descent.

… noise, which is so critical for finding better local minima also prevents full convergence to the minimum. […] So in order to reduce the fluctuations we can either decrease (anneal) the learning rate or have an adaptive batch size.

They also suggest using mini-batches of samples to reduce the noise of the weight updates. This is where the error gradient is estimated across a small subset of samples from the training dataset instead of one sample in the case of stochastic gradient descent or all samples in the case of batch gradient descent.

This variation later became known as Mini-Batch Gradient Descent and is the default when training neural networks.

Another method to remove noise is to use “mini-batches”, that is, start with a small batch size and increase the size as training proceeds.

This tip highlights the importance that the order of examples shown to the model during training has on the training process.

Generally, the authors highlight that the learning algorithm performs better when the next example used to update the model is different from the previous example. Ideally, it is the most different or unfamiliar to the model.

Networks learn the fastest from the most unexpected sample. Therefore, it is advisable to choose a sample at each iteration that is the most unfamiliar to the system.

One simple way to implement this trick is to ensure that successive examples used to update the model parameters are from different classes.

… a very simple trick that crudely implements this idea is to simply choose successive examples that are from different classes since training examples belonging to the same class will most likely contain similar information.

This trick can also be implemented by showing and re-showing examples to the model it gets the most wrong or makes the most error on when making a prediction. This approach can be effective, but can also lead to disaster if the examples that are over-represented during training are outliers.

Choose Examples with Maximum Information Content

1. Shuffle the training set so that successive training examples never (rarely) belong to the same class.

2. Present input examples that produce a large error more frequently than examples that produce a small error

This tip highlights the importance of data preparation prior to training a neural network model.

The authors point out that neural networks often learn faster when the examples in the training dataset sum to zero. This can be achieved by subtracting the mean value from each input variable, called centering.

Convergence is usually faster if the average of each input variable over the training set is close to zero.

They also comment that this centering of inputs also improves the convergence of the model when applied to the inputs to hidden layers from prior layers. This is fascinating as it lays the foundation for the Batch Normalization technique developed and made widely popular nearly 15 years later.

Therefore, it is good to shift the inputs so that the average over the training set is close to zero. This heuristic should be applied at all layers which means that we want the average of the outputs of a node to be close to zero because these outputs are the inputs to the next layer

The authors also comment on the need to normalize the spread of the input variables. This can be achieved by dividing the values by their standard deviation. For variables that have a Gaussian distribution, centering and normalizing values in this way means that they will be reduced to a standard Gaussian with a mean of zero and a standard deviation of one.

Scaling speeds learning because it helps to balance out the rate at which the weights connected to the input nodes learn.

Finally, they suggest de-correlating the input variables. This means removing any linear dependence between the input variables and can be achieved using a Principal Component Analysis as a data transform.

Principal component analysis (also known as the Karhunen-Loeve expansion) can be used to remove linear correlations in inputs

This tip on data preparation can be summarized as follows:

Transforming the Inputs

1. The average of each input variable over the training set should be close to zero.

2. Scale input variables so that their covariances are about the same.

3. Input variables should be uncorrelated if possible.

These recommended three steps of data preparation of centering, normalizing, and de-correlating are summarized nicely in a figure, reproduced from the book below:

The centering of input variables may or may not be the best approach when using the more modern ReLU activation functions in the hidden layers of your network, so I’d recommend evaluating both standardization and normalization procedures when preparing data for your model.

This tip recommends the use of sigmoid activation functions in the hidden layers of your network.

Nonlinear activation functions are what give neural networks their nonlinear capabilities. One of the most common forms of activation function is the sigmoid …

Specifically, the authors refer to a sigmoid activation function as any S-shaped function, such as the logistic (referred to as sigmoid) or hyperbolic tangent function (referred to as tanh).

Symmetric sigmoids such as hyperbolic tangent often converge faster than the standard logistic function.

The authors recommend modifying the default functions (if needed) so that the midpoint of the function is at zero.

The use of logistic and tanh activation functions for the hidden layers is no longer a sensible default as the performance models that use ReLU converge much faster.

This tip highlights a more careful consideration of the choice of target variables.

In the case of binary classification problems, target variables may be in the set {0, 1} for the limits of the logistic activation function or in the set {-1, 1} for the hyperbolic tangent function when using the cross-entropy or hinge loss functions respectively, even in modern neural networks.

The authors suggest that using values at the extremes of the activation function may make learning the problem more challenging.

Common wisdom might seem to suggest that the target values be set at the value of the sigmoid’s asymptotes. However, this has several drawbacks.

They suggest that achieving values at the point of saturation of the activation function (edges) may require larger and larger weights, which could make the model unstable.

One approach to addressing this is to use target values away from the edge of the output function.

Choose target values at the point of the maximum second derivative on the sigmoid so as to avoid saturating the output units.

I recall that in the 1990s, it was common advice to use target values in the set of {0.1 and 0.9} with the logistic function instead of {0 and 1}.

This tip highlights the importance of the choice of weight initialization scheme and how it is tightly related to the choice of activation function.

In the context of the sigmoid activation function, they suggest that the initial weights for the network should be chosen to activate the function in the linear region (e.g. the line part not the curve part of the S-shape).

The starting values of the weights can have a significant effect on the training process. Weights should be chosen randomly but in such a way that the sigmoid is primarily activated in its linear region.

This advice may also apply to the weight activation for the ReLU where the linear part of the function is positive.

This highlights the important impact that initial weights have on learning, where large weights saturate the activation function, resulting in unstable learning, and small weights result in very small gradients and, in turn, slow learning. Ideally, we seek model weights that are over the linear (non-curvy) part of the activation function.

… weights that range over the sigmoid’s linear region have the advantage that (1) the gradients are large enough that learning can proceed and (2) the network will learn the linear part of the mapping before the more difficult nonlinear part.

The authors suggest a random weight initialization scheme that uses the number of nodes in the previous layer, the so-called fan-in. This is interesting as it is a precursor of what became known as the Xavier weight initialization scheme.

This tip highlights the importance of choosing the learning rate.

The learning rate is the amount that the model weights are updated each iteration of the algorithm. A small learning rate can cause slower convergence but perhaps a better result, whereas a larger learning rate can result in faster convergence but perhaps to a less optimal result.

The authors suggest decreasing the learning rate when the weight values begin changing back and forth, e.g. oscillating.

Most of those schemes decrease the learning rate when the weight vector “oscillates”, and increase it when the weight vector follows a relatively steady direction.

They comment that this is a hard strategy when using online gradient descent as, by default, the weights will oscillate a lot.

The authors also recommend using one learning rate for each parameter in the model. The goal is to help each part of the model to converge at the same rate.

… it is clear that picking a different learning rate (eta) for each weight can improve the convergence. […] The main philosophy is to make sure that all the weights in the network converge roughly at the same speed.

They refer to this property as “*equalizing the learning speeds*” of each model parameter.

Equalize the Learning Speeds

– give each weight its own learning rate

– learning rates should be proportional to the square root of the number of inputs to the unit

– weights in lower layers should typically be larger than in the higher layers

In addition to using a learning rate per parameter, the authors also recommend using momentum and using adaptive learning rates.

It’s interesting that these recommendations later became enshrined in methods like AdaGrad and Adam that are now popular defaults.

This final tip is perhaps less relevant today, and I recommend trying radial basis functions (RBF) instead of sigmoid activation functions in some cases.

The authors suggest that training RBF units can be faster than training units using a sigmoid activation.

Unlike sigmoidal units which can cover the entire space, a single RBF unit covers only a small local region of the input space. This can be an advantage because learning can be faster.

After these tips, the authors go on to provide a theoretical grounding for why many of these tips are a good idea and are expected to result in better or faster convergence when training a neural network model.

Specifically, the tips supported by this analysis are:

- Subtract the means from the input variables
- Normalize the variances of the input variables.
- De-correlate the input variables.
- Use a separate learning rate for each weight.

The remainer of the chapter focuses on the use of second order optimization algorithms for training neural network models.

This may not be everyone’s cup of tea and requires a background and good memory of matrix calculus. You may want to skip it.

You may recall that the first derivative is the slope of a function (how steep it is) and that backpropagation uses the first derivative to update the models in proportion to their output error. These methods are referred to as first order optimization algorithms, e.g. optimization algorithms that use the first derivative of the error in the output of the model.

You may also recall from calculus that the second order derivative is the rate of change in the first order derivative, or in this case, the gradient of the error gradient itself. It gives an idea of how curved the loss function is for the current set of weights. Algorithms that use the second derivative are referred to as second order optimization algorithms.

The authors go on to introduce five second order optimization algorithms, specifically:

- Newton
- Conjugate Gradient
- Gauss-Newton
- Levenberg Marquardt
- Quasi-Newton (BFGS)

These algorithms require access to the Hessian matrix or an approximation of the Hessian matrix. You may also recall the Hessian matrix if you covered a theoretical introduction to the backpropagation algorithm. In a hand-wavy way, we use the Hessian to describe the second order derivatives for the model weights.

The authors proceed to outline a number of methods that can be used to approximate the Hessian matrix (for use in second order optimization algorithms), such as: finite difference, square Jacobian approximation, the diagonal of the Hessian, and more.

They then go on to analyze the Hessian in multilayer neural networks and the effectiveness of second order optimization algorithms.

In summary, they highlight that perhaps second order methods are more appropriate for smaller neural network models trained using batch gradient descent.

Classical second-order methods are impractical in almost all useful cases.

The chapter ends with a very useful summary of tips for getting the most out of backpropagation when training neural network models.

This summary is reproduced below:

– shuffle the examples

– center the input variables by subtracting the mean

– normalize the input variable to a standard deviation of 1

– if possible, de-correlate the input variables.

– pick a network with the sigmoid function shown in figure 1.4

– set the target values within the range of the sigmoid, typically +1 and -1.

– initialize the weights to random values (as prescribed by 1.16).

This section provides more resources on the topic if you are looking to go deeper.

- Neural Networks: Tricks of the Trade, First Edition, 1998.
- Neural Networks: Tricks of the Trade, Second Edition, 2012.
- Efficient BackProp, Preprint, 1998.
- Hessian matrix, Wikipedia.

In this post, you discovered tips and tricks for getting the most out of the backpropagation algorithm when training neural network models.

Have you tried any of these tricks on your projects?

Let me know about your results in the comments below.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post 8 Tricks for Configuring Backpropagation to Train Better Neural Networks appeared first on Machine Learning Mastery.

]]>The post Neural Networks: Tricks of the Trade Review appeared first on Machine Learning Mastery.

]]>Deep learning neural networks are challenging to configure and train.

There are decades of tips and tricks spread across hundreds of research papers, source code, and in the heads of academics and practitioners.

The book “Neural Networks: Tricks of the Trade” originally published in 1998 and updated in 2012 at the cusp of the deep learning renaissance ties together the disparate tips and tricks into a single volume. It includes advice that is required reading for all deep learning neural network practitioners.

In this post, you will discover the book “*Neural Networks: Tricks of the Trade*” that provides advice by neural network academics and practitioners on how to get the most out of your models.

After reading this post, you will know:

- The motivation for why the book was written.
- A breakdown of the chapters and topics in the first and second editions.
- A list and summary of the must-read chapters for every neural network practitioner.

Let’s get started.

Neural Networks: Tricks of the Trade is a collection of papers on techniques to get better performance from neural network models.

The first edition was published in 1998 comprised of five parts and 17 chapters. The second edition was published right on the cusp of the new deep learning renaissance in 2012 and includes three more parts and 13 new chapters.

If you are a deep learning practitioner, then it is a must read book.

I own and reference both editions.

The motivation for the book was to collate the empirical and theoretically grounded tips, tricks, and best practices used to get the best performance from neural network models in practice.

The author’s concern is that many of the useful tips and tricks are tacit knowledge in the field, trapped in peoples heads, code bases, or at the end of conference papers and that beginners to the field should be aware of them.

It is our belief that researchers and practitioners acquire, through experience and word-of-mouth, techniques and heuristics that help them successfully apply neural networks to difficult real-world problems. […] they are usually hidden in people’s heads or in the back pages of space-constrained conference papers.

The book is an effort to try to group the tricks together, after the success of a workshop at the 1996 NIPS conference with the same name.

This book is an outgrowth of a 1996 NIPS workshop called Tricks of the Trade whose goal was to begin the process of gathering and documenting these tricks. The interest that the workshop generated motivated us to expand our collection and compile it into this book.

— Page 1, Neural Networks: Tricks of the Trade, Second Edition, 2012.

The first edition of the book was put together (edited) by Genevieve Orr and Klaus-Robert Muller comprised of five parts and 17 chapters and was published 20 years ago in 1998.

Each part includes a useful preface that summarizes what to expect in the upcoming chapters, and each chapter written by one or more academics in the field.

The breakdown of this first edition was as follows:

- Chapter 1: Efficient BackProp

- Chapter 2: Early Stopping – But When?
- Chapter 3: A Simple Trick for Estimating the Weight Decay Parameter
- Chapter 4: Controlling the Hyperparameter Search on MacKay’s Bayesian Neural Network Framework
- Chapter 5: Adaptive Regularization in Neural Network Modeling
- Chapter 6: Large Ensemble Averaging

- Chapter 7: Square Unit Augmented, Radically Extended, Multilayer Perceptrons
- Chapter 8: A Dozen Tricks with Multitask Learning
- Chapter 9: Solving the Ill-Conditioning on Neural Network Learning
- Chapter 10: Centering Neural Network Gradient Factors
- Chapter 11: Avoiding Roundoff Error in Backpropagating Derivatives

- Chapter 12: Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation
- Chapter 13: Combining Neural Networks and Context-Driven Search for On-Line Printed Handwriting Recognition in the Newton
- Chapter 14: Neural Network Classification and Prior Class Probabilities
- Chapter 15: Applying Divide and Conquer to Large Scale Pattern Recognition Tasks

- Chapter 16: Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions
- Chapter 17: How to Train Neural Networks

It is an expensive book, and if you can pick-up a cheap second-hand copy of this first edition, then I highly recommend it.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The second edition of the book was released in 2012, seemingly right at the beginning of the large push that became “deep learning.” As such, the book captures the new techniques at the time such as layer-wise pretraining and restricted Boltzmann machines.

It was too early to focus on the ReLU, ImageNet with CNNs, and use of large LSTMs.

Nevertheless, the second edition included three new parts and 13 new chapters.

The breakdown of the additions in the second edition are as follows:

- Chapter 18: Stochastic Gradient Descent Tricks
- Chapter 19: Practical Recommendations for Gradient-Based Training of Deep Architectures
- Chapter 20: Training Deep and Recurrent Networks with Hessian-Free Optimization
- Chapter 21: Implementing Neural Networks Efficiently

- Chapter 22: Learning Feature Representations with K-Means
- Chapter 23: Deep Big Multilayer Perceptrons for Digit Recognition
- Chapter 24: A Practical Guide to Training Restricted Boltzmann Machines
- Chapter 25: Deep Boltzmann Machines and the Centering Trick
- Chapter 26: Deep Learning via Semi-supervised Embedding

- Chapter 27: A Practical Guide to Applying Echo State Networks
- Chapter 28: Forecasting with Recurrent Neural Networks: 12 Tricks
- Chapter 29: Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks
- Chapter 30: 10 Steps and Some Tricks to Set up Neural Reinforcement Controllers

The whole book is a good read, although I don’t recommend reading all of it if you are looking for quick and useful tips that you can use immediately.

This is because many of the chapters focus on the writers’ pet projects, or on highly specialized methods. Instead, I recommend reading four specific chapters, two from the first edition and two from the second.

The second edition of the book is worth purchasing for these four chapters alone, and I highly recommend picking up a copy for yourself, your team, or your office.

Fortunately, there are pre-print PDFs of these chapters available for free online.

The recommended chapters are:

**Chapter 1**: Efficient BackProp, by Yann LeCun, et al.**Chapter 2**: Early Stopping – But When?, by Lutz Prechelt.**Chapter 18**: Stochastic Gradient Descent Tricks, by Leon Bottou.**Chapter 19**: Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio.

Let’s take a closer look at each of these chapters in turn.

This chapter focuses on providing very specific tips to get the most out of the stochastic gradient descent optimization algorithm and the backpropagation weight update algorithm.

Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.

— Page 9, Neural Networks: Tricks of the Trade, First Edition, 1998.

The chapter proceeds to provide a dense and theoretically supported list of tips for configuring the algorithm, preparing input data, and more.

The chapter is so dense that it is hard to summarize, although a good list of recommendations is provided in the “*Discussion and Conclusion*” section at the end, quoted from the book below:

– shuffle the examples

– center the input variables by subtracting the mean

– normalize the input variable to a standard deviation of 1

– if possible, decorrelate the input variables.

– pick a network with the sigmoid function shown in figure 1.4

– set the target values within the range of the sigmoid, typically +1 and -1.

– initialize the weights to random values as prescribed by 1.16.The preferred method for training the network should be picked as follows:

– if the training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.

– if the training set is not too large, or if the task is regression, use conjugate gradient.

— Pages 47-48, Neural Networks: Tricks of the Trade, First Edition, 1998.

The field of applied neural networks has come a long way in the twenty years since this was published (e.g. the comments on sigmoid activation functions are no longer relevant), yet the basics have not changed.

This chapter is required reading for all deep learning practitioners.

This chapter describes the simple yet powerful regularization method called early stopping that will halt the training of a neural network when the performance of the model begins to degrade on a hold-out validation dataset.

Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting (“early stopping”)

— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.

The challenge of early stopping is the choice and configuration of the trigger used to stop the training process, and the systematic configuration of early stopping is the focus of the chapter.

The general early stopping criteria are described as:

**GL**: stop as soon as the generalization loss exceeds a specified threshold.**PQ**: stop as soon as the quotient of generalization loss and progress exceeds a threshold.**UP**: stop when the generalization error increases in strips.

Three recommendations are provided, e.g. “*the trick*“:

1. Use fast stopping criteria unless small improvements of network performance (e.g. 4%) are worth large increases of training time (e.g. factor 4).

2. To maximize the probability of finding a “good” solution (as opposed to maximizing the average quality of solutions), use a GL criterion.

3. To maximize the average quality of solutions, use a PQ criterion if the net- work overfits only very little or an UP criterion otherwise.

— Page 60, Neural Networks: Tricks of the Trade, First Edition, 1998.

The rules are analyzed empirically over a large number of training runs and test problems. The crux of the finding is that being more patient with the early stopping criteria results in better hold-out performance at the cost of additional computational complexity.

I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).

— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.

This chapter focuses on a detailed review of the stochastic gradient descent optimization algorithm and tips to help get the most out of it.

This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.

— Page 421, Neural Networks: Tricks of the Trade, Second Edition, 2012.

There is a lot of overlap with *Chapter 1: Efficient BackProp*, and although the chapter calls out tips along the way with boxes, a useful list of tips is not summarized at the end of the chapter.

Nevertheless, it is a compulsory read for all neural network practitioners.

Below is my own summary of the tips called out in boxes throughout the chapter, mostly quoting directly from the second edition:

- Use stochastic gradient descent (batch=1) when training time is the bottleneck.
- Randomly shuffle the training examples.
- Use preconditioning techniques.
- Monitor both the training cost and the validation error.
- Check the gradients using finite differences.
- Experiment with the learning rates [with] a small sample of the training set.
- Leverage the sparsity of the training examples.
- Use a decaying learning rate.
- Try averaged stochastic gradient (i.e. a specific variant of the algorithm).

Some of these tips are pithy without context; I recommend reading the chapter.

This chapter focuses on the effective training of neural networks and early deep learning models.

It ties together the classical advice from Chapters 1 and 29 but adds comments on (at the time) recent deep learning developments like greedy layer-wise pretraining, modern hardware like GPUs, modern efficient code libraries like BLAS, and advice from real projects tuning the training of models, like the order to train hyperparameters.

This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on backpropagated gradient and gradient-based optimization.

— Page 437, Neural Networks: Tricks of the Trade, Second Edition, 2012.

It’s also long, divided into six main sections:

**Deep Learning Innovations**. Including greedy layer-wise pretraining, denoising autoencoders, and online learning.**Gradients**. Including mini-batch gradient descent and automatic differentiation.**Hyperparameters**. Including learning rate, mini-batch size, epochs, momentum, nodes, weight regularization, activity regularization, hyperparameter search, and recommendations.**Debugging**and Analysis. Including monitoring loss for overfitting, visualization, and statistics.**Other Recommendations**. Including GPU hardware and use of efficient linear algebra libraries such as BLAS.**Open Questions**. Including the difficulty of training deep models and adaptive learning rates.

There’s far too much for me to summarize; the chapter is dense with useful advice for configuring and tuning neural network models.

Without a doubt, this is required reading and provided the seeds for the recommendations later described in the 2016 book Deep Learning, of which Yoshua Bengio was one of three authors.

The chapter finishes on a strong, optimistic note.

The practice summarized here, coupled with the increase in available computing power, now allows researchers to train neural networks on a scale that is far beyond what was possible at the time of the first edition of this book, helping to move us closer to artificial intelligence.

— Page 473, Neural Networks: Tricks of the Trade, Second Edition, 2012.

- Neural Networks: Tricks of the Trade, First Edition, 1998.
- Neural Networks: Tricks of the Trade, Second Edition, 2012.

- Neural Networks: Tricks of the Trade, Second Edition, 2012. Springer Homepage.
- Neural Networks: Tricks of the Trade, Second Edition, 2012. Google Books

- Efficient BackProp, 1998.
- Early Stopping – But When?, 1998.
- Stochastic Gradient Descent Tricks, 2012.
- Practical Recommendations for Gradient-Based Training of Deep Architectures, 2012.

In this post, you discovered the book “*Neural Networks: Tricks of the Trade*” that provides advice from neural network academics and practitioners on how to get the most out of your models.

Have you read some or all of this book? What do you think of it?

Let me know in the comments below.

The post Neural Networks: Tricks of the Trade Review appeared first on Machine Learning Mastery.

]]>The post How to Get Better Deep Learning Results (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>Configuring neural network models is often referred to as a “*dark art*.”

This is because there are no hard and fast rules for configuring a network for a given problem. We cannot analytically calculate the optimal model type or model configuration for a given dataset.

Fortunately, there are techniques that are known to address specific issues when configuring and training a neural network that are available in modern deep learning libraries such as Keras.

In this crash course, you will discover how you can confidently get better performance from your deep learning models in seven days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

The list below provides some general guidelines as to who this course was designed for.

You need to know:

- Your way around basic Python and NumPy.
- The basics of Keras for deep learning.

You do NOT need to know:

- How to be a math wiz!
- How to be a deep learning expert!

This crash course will take you from a developer that knows a little deep learning to a developer who can get better performance on your deep learning project.

Note: This crash course assumes you have a working Python 2 or 3 SciPy environment with at least NumPy and Keras 2 installed. If you need help with your environment, you can follow the step-by-step tutorial here:

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below are seven lessons that will allow you to confidently improve the performance of your deep learning model:

**Lesson 01**: Better Deep Learning Framework**Lesson 02**: Batch Size**Lesson 03**: Learning Rate Schedule**Lesson 04**: Batch Normalization**Lesson 05**: Weight Regularization**Lesson 06**: Adding Noise**Lesson 07**: Early Stopping

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help (hint, I have all of the answers directly on this blog; use the search box).

I do provide more help in the form of links to related posts because I want you to build up some confidence and inertia.

Post your results in the comments; I’ll cheer you on!

Hang in there; don’t give up.

**Note**: This is just a crash course. For a lot more detail and fleshed out tutorials, see my book on the topic titled “Better Deep Learning.”

In this lesson, you will discover a framework that you can use to systematically improve the performance of your deep learning model.

Modern deep learning libraries such as Keras allow you to define and start fitting a wide range of neural network models in minutes with just a few lines of code.

Nevertheless, it is still challenging to configure a neural network to get good performance on a new predictive modeling problem.

There are three types of problems that are straightforward to diagnose with regard to the poor performance of a deep learning neural network model; they are:

**Problems with Learning**. Problems with learning manifest in a model that cannot effectively learn a training dataset or shows slow progress or bad performance when learning the training dataset.**Problems with Generalization**. Problems with generalization manifest in a model that overfits the training dataset and makes poor performance on a holdout dataset.**Problems with Predictions**. Problems with predictions manifest as the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

The sequential relationship between the three areas in the proposed breakdown allows the issue of deep learning model performance to be first isolated, then targeted with a specific technique or methodology.

We can summarize techniques that assist with each of these problems as follows:

**Better Learning**. Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.**Better Generalization**. Techniques that improve the performance of a neural network model on a holdout dataset.**Better Predictions**. Techniques that reduce the variance in the performance of a final model.

You can use this framework to first diagnose the type of problem that you have and then identify a technique to evaluate to attempt to address your problem.

For this lesson, you must list two techniques or areas of focus that belong to each of the three areas of the framework.

Having trouble? Note that we will be looking some examples from two of the three areas as part of this mini-course.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to control the speed of learning with the batch size.

In this lesson, you will discover the importance of the batch size when training neural networks.

Neural networks are trained using gradient descent where the estimate of the error used to update the weights is calculated based on a subset of the training dataset.

The number of examples from the training dataset used in the estimate of the error gradient is called the batch size and is an important hyperparameter that influences the dynamics of the learning algorithm.

The choice of batch size controls how quickly the algorithm learns, for example:

**Batch Gradient Descent**. Batch size is set to the number of examples in the training dataset, more accurate estimate of error but longer time between weight updates.**Stochastic Gradient Descent**. Batch size is set to 1, noisy estimate of error but frequent updates to weights.**Minibatch Gradient Descent**. Batch size is set to a value more than 1 and less than the number of training examples, trade-off between batch and stochastic gradient descent.

Keras allows you to configure the batch size via the *batch_size* argument to the *fit()* function, for example:

# fit model history = model.fit(trainX, trainy, epochs=1000, batch_size=len(trainX))

The example below demonstrates a Multilayer Perceptron with batch gradient descent on a binary classification problem.

# example of batch gradient descent from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=1000, batch_size=len(trainX), verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with each type of gradient descent (batch, minibatch, and stochastic) and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to fine tune a model during training with a learning rate schedule

In this lesson, you will discover how to configure an adaptive learning rate schedule to fine tune the model during the training run.

The amount of change to the model during each step of this search process, or the step size, is called the “*learning rate*” and provides perhaps the most important hyperparameter to tune for your neural network in order to achieve good performance on your problem.

Configuring a fixed learning rate is very challenging and requires careful experimentation. An alternative to using a fixed learning rate is to instead vary the learning rate over the training process.

Keras provides the *ReduceLROnPlateau* learning rate schedule that will adjust the learning rate when a plateau in model performance is detected, e.g. no change for a given number of training epochs. For example:

# define learning rate schedule rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1)

This callback is designed to reduce the learning rate after the model stops improving with the hope of fine-tuning model weights during training.

The example below demonstrates a Multilayer Perceptron with a learning rate schedule on a binary classification problem, where the learning rate will be reduced by an order of magnitude if no change is detected in validation loss over 5 training epochs.

# example of a learning rate schedule from sklearn.datasets import make_circles from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.callbacks import ReduceLROnPlateau from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy']) # define learning rate schedule rlrp = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=5, min_delta=1E-7, verbose=1) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0, callbacks=[rlrp]) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without the learning rate schedule and describe the effect that the learning rate schedule has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how you can accelerate the training process with batch normalization

In this lesson, you will discover how to accelerate the training process of your deep learning neural network using batch normalization.

Batch normalization, or batchnorm for short, is proposed as a technique to help coordinate the update of multiple layers in the model.

The authors of the paper introducing batch normalization refer to change in the distribution of inputs during training as “*internal covariate shift*“. Batch normalization was designed to counter the internal covariate shift by scaling the output of the previous layer, specifically by standardizing the activations of each input variable per mini-batch, such as the activations of a node from the previous layer.

Keras supports Batch Normalization via a separate *BatchNormalization* layer that can be added between the hidden layers of your model. For example:

model.add(BatchNormalization())

The example below demonstrates a Multilayer Perceptron model with batch normalization on a binary classification problem.

# example of batch normalization from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD from keras.layers import BatchNormalization from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=1000, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(50, input_dim=2, activation='relu')) model.add(BatchNormalization()) model.add(Dense(1, activation='sigmoid')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=300, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without batch normalization and describe the effect that batch normalization has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting using weight regularization.

In this lesson, you will discover how to reduce overfitting of your deep learning neural network using weight regularization.

A model with large weights is more complex than a model with smaller weights. It is a sign of a network that may be overly specialized to training data.

The learning algorithm can be updated to encourage the network toward using small weights.

One way to do this is to change the calculation of loss used in the optimization of the network to also consider the size of the weights. This is called weight regularization or weight decay.

Keras supports weight regularization via the *kernel_regularizer* argument on a layer, which can be configured to use the L1 or L2 vector norm, for example:

model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01)))

The example below demonstrates a Multilayer Perceptron model with weight decay on a binary classification problem.

# example of weight decay from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.regularizers import l2 from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu', kernel_regularizer=l2(0.01))) model.add(Dense(1, activation='sigmoid')) # compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without weight regularization and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting by adding noise to your model

In this lesson, you will discover that adding noise to a neural network during training can improve the robustness of the network, resulting in better generalization and faster learning.

Training a neural network with a small dataset can cause the network to memorize all training examples, in turn leading to poor performance on a holdout dataset.

One approach to making the input space smoother and easier to learn is to add noise to inputs during training.

The addition of noise during the training of a neural network model has a regularization effect and, in turn, improves the robustness of the model.

Noise can be added to your model in Keras via the *GaussianNoise* layer. For example:

model.add(GaussianNoise(0.1))

Noise can be added to a model at the input layer or between hidden layers.

The example below demonstrates a Multilayer Perceptron model with added noise between the hidden layers on a binary classification problem.

# example of adding noise from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.layers import GaussianNoise from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(GaussianNoise(0.1)) model.add(Dense(1, activation='sigmoid')) # compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without the addition of noise and describe the effect that it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

In the next lesson, you will discover how to reduce overfitting using early stopping.

In this lesson, you will discover that stopping the training of a neural network early before it has overfit the training dataset can reduce overfitting and improve the generalization of deep neural networks.

A major challenge in training neural networks is how long to train them.

Too little training will mean that the model will underfit the train and the test sets. Too much training will mean that the model will overfit the training dataset and have poor performance on the test set.

A compromise is to train on the training dataset but to stop training at the point when performance on a validation dataset starts to degrade. This simple, effective, and widely used approach to training neural networks is called early stopping.

Keras supports early stopping via the *EarlyStopping* callback that allows you to specify the metric to monitor during training.

# patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200)

The example below demonstrates a Multilayer Perceptron with early stopping on a binary classification problem that will stop when the validation loss has not improved for 200 training epochs.

# example of early stopping from sklearn.datasets import make_circles from keras.models import Sequential from keras.layers import Dense from keras.callbacks import EarlyStopping from matplotlib import pyplot # generate dataset X, y = make_circles(n_samples=100, noise=0.1, random_state=1) # split into train and test n_train = 30 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(500, input_dim=2, activation='relu')) model.add(Dense(1, activation='sigmoid')) # compile model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # patient early stopping es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=200) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=4000, verbose=0, callbacks=[es]) # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss learning curves pyplot.subplot(211) pyplot.title('Cross-Entropy Loss', pad=-40) pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy learning curves pyplot.subplot(212) pyplot.title('Accuracy', pad=-40) pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

For this lesson, you must run the code example with and without early stopping and describe the effect it has on the learning curves during training.

Post your answer in the comments below. I would love to see what you discover.

This was your final lesson.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- A framework that you can use to systematically diagnose and improve the performance of your deep learning model.
- Batch size can be used to control the precision of the estimated error and the speed of learning during training.
- Learning rate schedule can be used to fine tune the model weights during training.
- Batch normalization can be used to dramatically accelerate the training process of neural network models.
- Weight regularization will penalize models based on the size of the weights and reduce overfitting.
- Adding noise will make the model more robust to differences in input and reduce overfitting
- Early stopping will halt the training process at the right time and reduce overfitting.

This is just the beginning of your journey with deep learning performance improvement. Keep practicing and developing your skills.

Take the next step and check out my book on getting better performance with deep learning.

How did you do with the mini-course?

Did you enjoy this crash course?

Do you have any questions? Were there any sticking points?

Let me know. Leave a comment below.

The post How to Get Better Deep Learning Results (7-Day Mini-Course) appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to the Challenge of Training Deep Learning Neural Network Models appeared first on Machine Learning Mastery.

]]>Deep learning neural networks learn a mapping function from inputs to outputs.

This is achieved by updating the weights of the network in response to the errors the model makes on the training dataset. Updates are made to continually reduce this error until either a good enough model is found or the learning process gets stuck and stops.

The process of training neural networks is the most challenging part of using the technique in general and is by far the most time consuming, both in terms of effort required to configure the process and computational complexity required to execute the process.

In this post, you will discover the challenge of finding model parameters for deep learning neural networks.

After reading this post, you will know:

- Neural networks learn a mapping function from inputs to outputs that can be summarized as solving the problem of function approximation.
- Unlike other machine learning algorithms, the parameters of a neural network must be found by solving a non-convex optimization problem with many good solutions and many misleadingly good solutions.
- The stochastic gradient descent algorithm is used to solve the optimization problem where model parameters are updated each iteration using the backpropagation algorithm.

Let’s get started.

This tutorial is divided into four parts; they are:

- Neural Nets Learn a Mapping Function
- Learning Network Weights Is Hard
- Navigating the Error Surface
- Components of the Learning Algorithm

Deep learning neural networks learn a mapping function.

Developing a model requires historical data from the domain that is used as training data. This data is comprised of observations or examples from the domain with input elements that describe the conditions and an output element that captures what the observation means.

For example, a problem where the output is a quantity would be described generally as a regression predictive modeling problem. Whereas a problem where the output is a label would be described generally as a classification predictive modeling problem.

A neural network model uses the examples to learn how to map specific sets of input variables to the output variable. It must do this in such a way that this mapping works well for the training dataset, but also works well on new examples not seen by the model during training. This ability to work well on specific examples and new examples is called the ability of the model to generalize.

A multilayer perceptron is just a mathematical function mapping some set of input values to output values.

— Page 5, Deep Learning, 2016.

We can describe the relationship between the input variables and the output variables as a complex mathematical function. For a given model problem, we must believe that a true mapping function exists to best map input variables to output variables and that a neural network model can do a reasonable job at approximating the true unknown underlying mapping function.

A feedforward network defines a mapping and learns the value of the parameters that result in the best function approximation.

— Page 168, Deep Learning, 2016.

As such, we can describe the broader problem that neural networks solve as “*function approximation*.” They learn to approximate an unknown underlying mapping function given a training dataset. They do this by learning weights and the model parameters, given a specific network structure that we design.

It is best to think of feedforward networks as function approximation machines that are designed to achieve statistical generalization, occasionally drawing some insights from what we know about the brain, rather than as models of brain function.

— Page 169, Deep Learning, 2016.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Finding the parameters for neural networks in general is hard.

For many simpler machine learning algorithms, we can calculate an optimal model given the training dataset.

For example, we can use linear algebra to calculate the specific coefficients of a linear regression model and a training dataset that best minimizes the squared error.

Similarly, we can use optimization algorithms that offer convergence guarantees when finding an optimal set of model parameters for nonlinear algorithms such as logistic regression or support vector machines.

Finding parameters for many machine learning algorithms involves solving a convex optimization problem: that is an error surface that is shaped like a bowl with a single best solution.

This is not the case for deep learning neural networks.

We can neither directly compute the optimal set of weights for a model, nor can we get global convergence guarantees to find an optimal set of weights.

Stochastic gradient descent applied to non-convex loss functions has no […] convergence guarantee, and is sensitive to the values of the initial parameters.

— Page 177, Deep Learning, 2016.

In fact, training a neural network is the most challenging part of using the technique.

It is quite common to invest days to months of time on hundreds of machines in order to solve even a single instance of the neural network training problem.

— Page 274, Deep Learning, 2016.

The use of nonlinear activation functions in the neural network means that the optimization problem that we must solve in order to find model parameters is not convex.

It is not a simple bowl shape with a single best set of weights that we are guaranteed to find. Instead, there is a landscape of peaks and valleys with many good and many misleadingly good sets of parameters that we may discover.

Solving this optimization is challenging, not least because the error surface contains many local optima, flat spots, and cliffs.

An iterative process must be used to navigate the non-convex error surface of the model. A naive algorithm that navigates the error is likely to become misled, lost, and ultimately stuck, resulting in a poorly performing model.

Neural network models can be thought to learn by navigating a non-convex error surface.

A model with a specific set of weights can be evaluated on the training dataset and the average error over all training datasets can be thought of as the error of the model. A change to the model weights will result in a change to the model error. Therefore, we seek a set of weights that result in a model with a small error.

This involves repeating the steps of evaluating the model and updating the model parameters in order to step down the error surface. This process is repeated until a set of parameters is found that is good enough or the search process gets stuck.

This is a search or an optimization process and we refer to optimization algorithms that operate in this way as gradient optimization algorithms, as they naively follow along the error gradient. They are computationally expensive, slow, and their empirical behavior means that using them in practice is more art than science.

The algorithm that is most commonly used to navigate the error surface is called stochastic gradient descent, or SGD for short.

Nearly all of deep learning is powered by one very important algorithm: stochastic gradient descent or SGD.

— Page 151, Deep Learning, 2016.

Other global optimization algorithms designed for non-convex optimization problems could be used, such as a genetic algorithm, but stochastic gradient descent is more efficient as it uses the gradient information specifically to update the model weights via an algorithm called backpropagation.

[Backpropagation] describes a method to calculate the derivatives of the network training error with respect to the weights by a clever application of the derivative chain-rule.

— Page 49, Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

Backpropagation refers to a technique from calculus to calculate the derivative (e.g. the slope or the gradient) of the model error for specific model parameters, allowing model weights to be updated to move down the gradient. As such, the algorithm used to train neural networks is also often referred to as simply backpropagation.

Actually, back-propagation refers only to the method for computing the gradient, while another algorithm, such as stochastic gradient descent, is used to perform learning using this gradient.

— Page 204, Deep Learning, 2016.

Stochastic gradient descent can be used to find the parameters for other machine learning algorithms, such as linear regression, and it is used when working with very large datasets, although if there are sufficient resources, then convex-based optimization algorithms are significantly more efficient.

Training a deep learning neural network model using stochastic gradient descent with backpropagation involves choosing a number of components and hyperparameters. In this section, we’ll take a look at each in turn.

An error function must be chosen, often called the objective function, cost function, or the loss function. Typically, a specific probabilistic framework for inference is chosen called Maximum Likelihood. Under this framework, the commonly chosen loss functions are cross entropy for classification problems and mean squared error for regression problems.

**Loss Function**. The function used to estimate the performance of a model with a specific set of weights on examples from the training dataset.

The search or optimization process requires a starting point from which to begin model updates. The starting point is defined by the initial model parameters or weights. Because the error surface is non-convex, the optimization algorithm is sensitive to the initial starting point. As such, small random values are chosen as the initial model weights, although different techniques can be used to select the scale and distribution of these values. These techniques are referred to as “*weight initialization*” methods.

**Weight Initialization**. The procedure by which the initial small random values are assigned to model weights at the beginning of the training process.

When updating the model, a number of examples from the training dataset must be used to calculate the model error, often referred to simply as “*loss*.” All examples in the training dataset may be used, which may be appropriate for smaller datasets. Alternately, a single example may be used which may be appropriate for problems where examples are streamed or where the data changes often. A hybrid approach may be used where the number of examples from the training dataset may be chosen and used to used to estimate the error gradient. The choice of the number of examples is referred to as the batch size.

**Batch Size**. The number of examples used to estimate the error gradient before updating the model parameters.

Once an error gradient has been estimated, the derivative of the error can be calculated and used to update each parameter. There may be statistical noise in the training dataset and in the estimate of the error gradient. Also, the depth of the model (number of layers) and the fact that model parameters are updated separately means that it is hard to calculate exactly how much to change each model parameter to best move down the whole model down the error gradient.

Instead, a small portion of the update to the weights is performed each iteration. A hyperparameter called the “*learning rate*” controls how much to update model weights and, in turn, controls how fast a model learns on the training dataset.

**Learning Rate**: The amount that each model parameter is updated per cycle of the learning algorithm.

The training process must be repeated many times until a good or good enough set of model parameters is discovered. The total number of iterations of the process is bounded by the number of complete passes through the training dataset after which the training process is terminated. This is referred to as the number of training “*epochs*.”

**Epochs**. The number of complete passes through the training dataset before the training process is terminated.

There are many extensions to the learning algorithm, although these five hyperparameters generally control the learning algorithm for deep learning neural networks.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Neural Networks for Pattern Recognition, 1995.

In this post, you discovered the challenge of finding model parameters for deep learning neural networks.

Specifically, you learned:

- Neural networks learn a mapping function from inputs to outputs that can be summarized as solving the problem of function approximation.
- Unlike other machine learning algorithms, the parameters of a neural network must be found by solving a non-convex optimization problem with many good solutions and many misleadingly good solutions.
- The stochastic gradient descent algorithm is used to solve the optimization problem where model parameters are updated each iteration using the backpropagation algorithm.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to the Challenge of Training Deep Learning Neural Network Models appeared first on Machine Learning Mastery.

]]>The post How to Control Neural Network Model Capacity With Nodes and Layers appeared first on Machine Learning Mastery.

]]>The capacity of a deep learning neural network model controls the scope of the types of mapping functions that it is able to learn.

A model with too little capacity cannot learn the training dataset meaning it will underfit, whereas a model with too much capacity may memorize the training dataset, meaning it will overfit or may get stuck or lost during the optimization process.

The capacity of a neural network model is defined by configuring the number of nodes and the number of layers.

In this tutorial, you will discover how to control the capacity of a neural network model and how capacity impacts what a model is capable of learning.

After completing this tutorial, you will know:

- Neural network model capacity is controlled both by the number of nodes and the number of layers in the model.
- A model with a single hidden layer and sufficient number of nodes has the capability of learning any mapping function, but the chosen learning algorithm may or may not be able to realize this capability.
- Increasing the number of layers provides a short-cut to increasing the capacity of the model with fewer resources, and modern techniques allow learning algorithms to successfully train deep models.

Let’s get started.

This tutorial is divided into five parts; they are:

- Controlling Neural Network Model Capacity
- Configure Nodes and Layers in Keras
- Multi-Class Classification Problem
- Change Model Capacity With Nodes
- Change Model Capacity With Layers

The goal of a neural network is to learn how to map input examples to output examples.

Neural networks learn mapping functions. The capacity of a network refers to the range or scope of the types of functions that the model can approximate.

Informally, a model’s capacity is its ability to fit a wide variety of functions.

— Pages 111-112, Deep Learning, 2016.

A model with less capacity may not be able to sufficiently learn the training dataset. A model with more capacity can model more different types of functions and may be able to learn a function to sufficiently map inputs to outputs in the training dataset. Whereas a model with too much capacity may memorize the training dataset and fail to generalize or get lost or stuck in the search for a suitable mapping function.

Generally, we can think of model capacity as a control over whether the model is likely to underfit or overfit a training dataset.

We can control whether a model is more likely to overfit or underfit by altering its capacity.

— Pages 111, Deep Learning, 2016.

The capacity of a neural network can be controlled by two aspects of the model:

- Number of Nodes.
- Number of Layers.

A model with more nodes or more layers has a greater capacity and, in turn, is potentially capable of learning a larger set of mapping functions.

A model with more layers and more hidden units per layer has higher representational capacity — it is capable of representing more complicated functions.

— Pages 428, Deep Learning, 2016.

The number of nodes in a layer is referred to as the **width**.

Developing wide networks with one layer and many nodes was relatively straightforward. In theory, a network with enough nodes in the single hidden layer can learn to approximate any mapping function, although in practice, we don’t know how many nodes are sufficient or how to train such a model.

The number of layers in a model is referred to as its **depth**.

Increasing the depth increases the capacity of the model. Training deep models, e.g. those with many hidden layers, can be computationally more efficient than training a single layer network with a vast number of nodes.

Modern deep learning provides a very powerful framework for supervised learning. By adding more layers and more units within a layer, a deep network can represent functions of increasing complexity.

— Pages 167, Deep Learning, 2016.

Traditionally, it has been challenging to train neural network models with more than a few layers due to problems such as vanishing gradients. More recently, modern methods have allowed the training of deep network models, allowing the developing of models of surprising depth that are capable of achieving impressive performance on challenging problems in a wide range of domains.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Keras allows you to easily add nodes and layers to your model.

The first argument of the layer specifies the number of nodes used in the layer.

Fully connected layers for the Multilayer Perceptron, or MLP, model are added via the Dense layer.

For example, we can create one fully-connected layer with 32 nodes as follows:

... layer = Dense(32)

Similarly, the number of nodes can be specified for recurrent neural network layers in the same way.

For example, we can create one LSTM layer with 32 nodes (or units) as follows:

... layer = LSTM(32)

Convolutional neural networks, or CNN, don’t have nodes, instead specify the number of filter maps and their shape. The number and size of filter maps define the capacity of the layer.

We can define a two-dimensional CNN with 32 filter maps, each with a size of 3 by 3, as follows:

... layer = Conv2D(32, (3,3))

Layers are added to a sequential model via calls to the add() function and passing in the layer.

Fully connected layers for the MLP can be added via repeated calls to add passing in the configured Dense layers; for example:

... model = Sequential() model.add(Dense(32)) model.add(Dense(64))

Similarly, the number of layers for a recurrent network can be added in the same way to give a stacked recurrent model.

An important difference is that recurrent layers expect a three-dimensional input, therefore the prior recurrent layer must return the full sequence of outputs rather than the single output for each node at the end of the input sequence.

This can be achieved by setting the “*return_sequences*” argument to “*True*“. For example:

... model = Sequential() model.add(LSTM(32, return_sequences=True)) model.add(LSTM(32))

Convolutional layers can be stacked directly, and it is common to stack one or two convolutional layers together followed by a pooling layer, then repeat this pattern of layers; for example:

... model = Sequential() model.add(Conv2D(16, (3,3))) model.add(Conv2D(16, (3,3))) model.add(MaxPooling2D((2,2))) model.add(Conv2D(32, (3,3))) model.add(Conv2D(32, (3,3))) model.add(MaxPooling2D((2,2)))

Now that we know how to configure the number of nodes and layers for models in Keras, we can look at how the capacity affects model performance on a multi-class classification problem.

We will use a standard multi-class classification problem as the basis to demonstrate the effect of model capacity on model performance.

The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

We can configure the problem to have a specific number of input variables via the “*n_features*” argument, and a specific number of classes or centers via the “*centers*” argument. The “*random_state*” can be used to seed the pseudorandom number generator to ensure that we always get the same samples each time the function is called.

For example, the call below generates 1,000 examples for a three class problem with two input variables.

... # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)

The results are the input and output elements of a dataset that we can model.

In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.

The complete example is listed below.

# scatter plot of blobs dataset from sklearn.datasets.samples_generator import make_blobs from matplotlib import pyplot from numpy import where # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # scatter plot for each class value for class_value in range(3): # select indices of points with the class label row_ix = where(y == class_value) # scatter plot for points with a different color pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show plot pyplot.show()

Running the example creates a scatter plot of the entire dataset. We can see that the chosen standard deviation of 2.0 means that the classes are not linearly separable (separable by a line), causing many ambiguous points.

This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different “*good enough*” candidate solutions.

In order to explore model capacity, we need more complexity in the problem than three classes and two variables.

For the purposes of the following experiments, we will use 100 input features and 20 classes; for example:

... # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=20, n_features=100, cluster_std=2, random_state=2)

In this section, we will develop a Multilayer Perceptron model, or MLP, for the blobs multi-class classification problem and demonstrate the effect that the number of nodes has on the ability of the model to learn.

We can start off by developing a function to prepare the dataset.

The input and output elements of the dataset can be created using the *make_blobs()* function as described in the previous section.

Next, the target variable must be one hot encoded. This is so that the model can learn to predict the probability of an input example belonging to each of the 20 classes.

We can use the to_categorical() Keras utility function to do this, for example:

# one hot encode output variable y = to_categorical(y)

Next, we can split the 1,000 examples in half and use 500 examples as the training dataset and 500 to evaluate the model.

# split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy

The *create_dataset()* function below ties these elements together and returns the train and test sets in terms of the input and output elements.

# prepare multi-class classification dataset def create_dataset(): # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=20, n_features=100, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy

We can call this function to prepare the dataset.

# prepare dataset trainX, trainy, testX, testy = create_dataset()

Next, we can define a function that will create the model, fit it on the training dataset, and then evaluate it on the test dataset.

The model needs to know the number of input variables in order to configure the input layer and the number of target classes in order to configure the output layer. These properties can be extracted from the training dataset directly.

# configure the model based on the data n_input, n_classes = trainX.shape[1], testy.shape[1]

We will define an MLP model with a single hidden layer that uses the rectified linear activation function and the He random weight initialization method.

The output layer will use the softmax activation function in order to predict a probability for each target class. The number of nodes in the hidden layer will be provided via an argument called “*n_nodes*“.

# define model model = Sequential() model.add(Dense(n_nodes, input_dim=n_input, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(n_classes, activation='softmax'))

The model will be optimized using stochastic gradient descent with a modest learning rate of 0.01 with a high momentum of 0.9, and a categorical cross entropy loss function will be used, suitable for multi-class classification.

# compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

The model will be fit for 100 training epochs, then the model will be evaluated on the test dataset.

# fit model on train set history = model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test set _, test_acc = model.evaluate(testX, testy, verbose=0)

Tying these elements together, the *evaluate_model()* function below takes the number of nodes and dataset as arguments and returns the history of the training loss at the end of each epoch and the accuracy of the final model on the test dataset.

# fit model with given number of nodes, returns test set accuracy def evaluate_model(n_nodes, trainX, trainy, testX, testy): # configure the model based on the data n_input, n_classes = trainX.shape[1], testy.shape[1] # define model model = Sequential() model.add(Dense(n_nodes, input_dim=n_input, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(n_classes, activation='softmax')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model on train set history = model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test set _, test_acc = model.evaluate(testX, testy, verbose=0) return history, test_acc

We can call this function with different numbers of nodes to use in the hidden layer.

The problem is relatively simple; therefore, we will review the performance of the model with 1 to 7 nodes.

We would expect that as the number of nodes is increased, that this would increase the capacity of the model and allow the model to better learn the training dataset, at least to a point limited by the chosen configuration for the learning algorithm (e.g. learning rate, batch size, and epochs).

The test accuracy for each configuration will be printed and the learning curves of training accuracy with each configuration will be plotted.

# evaluate model and plot learning curve with given number of nodes num_nodes = [1, 2, 3, 4, 5, 6, 7] for n_nodes in num_nodes: # evaluate model with a given number of nodes history, result = evaluate_model(n_nodes, trainX, trainy, testX, testy) # summarize final test set accuracy print('nodes=%d: %.3f' % (n_nodes, result)) # plot learning curve pyplot.plot(history.history['loss'], label=str(n_nodes)) # show the plot pyplot.legend() pyplot.show()

The full code listing is provided below for completeness.

# study of mlp learning curves given different number of nodes for multi-class classification from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from matplotlib import pyplot # prepare multi-class classification dataset def create_dataset(): # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=20, n_features=100, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy # fit model with given number of nodes, returns test set accuracy def evaluate_model(n_nodes, trainX, trainy, testX, testy): # configure the model based on the data n_input, n_classes = trainX.shape[1], testy.shape[1] # define model model = Sequential() model.add(Dense(n_nodes, input_dim=n_input, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(n_classes, activation='softmax')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model on train set history = model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test set _, test_acc = model.evaluate(testX, testy, verbose=0) return history, test_acc # prepare dataset trainX, trainy, testX, testy = create_dataset() # evaluate model and plot learning curve with given number of nodes num_nodes = [1, 2, 3, 4, 5, 6, 7] for n_nodes in num_nodes: # evaluate model with a given number of nodes history, result = evaluate_model(n_nodes, trainX, trainy, testX, testy) # summarize final test set accuracy print('nodes=%d: %.3f' % (n_nodes, result)) # plot learning curve pyplot.plot(history.history['loss'], label=str(n_nodes)) # show the plot pyplot.legend() pyplot.show()

Running the example first prints the test accuracy for each model configuration.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that as the number of nodes is increased, the capacity of the model to learn the problem is increased. This results in a progressive lowering of the generalization error of the model on the test dataset until 6 and 7 nodes when the model learns the problem perfectly.

nodes=1: 0.138 nodes=2: 0.380 nodes=3: 0.582 nodes=4: 0.890 nodes=5: 0.844 nodes=6: 1.000 nodes=7: 1.000

A line plot is also created showing cross entropy loss on the training dataset for each model configuration (1 to 7 nodes in the hidden layer) over the 100 training epochs.

We can see that as the number of nodes is increased, the model is able to better decrease the loss, e.g. to better learn the training dataset. This plot shows the direct relationship between model capacity, as defined by the number of nodes in the hidden layer and the model’s ability to learn.

The number of nodes can be increased to the point (e.g. 1,000 nodes) where the learning algorithm is no longer able to sufficiently learn the mapping function.

We can perform a similar analysis and evaluate how the number of layers impacts the ability of the model to learn the mapping function.

Increasing the number of layers can often greatly increase the capacity of the model, acting like a computational and learning shortcut to modeling a problem. For example, a model with one hidden layer of 10 nodes is not equivalent to a model with two hidden layers with five nodes each. The latter has a much greater capacity.

The danger is that a model with more capacity than is required is likely to overfit the training data, and as with a model that has too many nodes, a model with too many layers will likely be unable to learn the training dataset, getting lost or stuck during the optimization process.

First, we can update the *evaluate_model()* function to fit an MLP model with a given number of layers.

We know from the previous section that an MLP with about seven or more nodes fit for 100 epochs will learn the problem perfectly. We will, therefore, use 10 nodes in each layer to ensure the model has enough capacity in just one layer to learn the problem.

The updated function is listed below, taking the number of layers and dataset as arguments and returning the training history and test accuracy of the model.

# fit model with given number of layers, returns test set accuracy def evaluate_model(n_layers, trainX, trainy, testX, testy): # configure the model based on the data n_input, n_classes = trainX.shape[1], testy.shape[1] # define model model = Sequential() model.add(Dense(10, input_dim=n_input, activation='relu', kernel_initializer='he_uniform')) for _ in range(1, n_layers): model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(n_classes, activation='softmax')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test set _, test_acc = model.evaluate(testX, testy, verbose=0) return history, test_acc

Given that a single hidden layer model has enough capacity to learn this problem, we will explore increasing the number of layers to the point where the learning algorithm becomes unstable and can no longer learn the problem.

If the chosen modeling problem was more complex, we could explore increasing the layers and review the improvements in model performance to a point of diminishing returns.

In this case, we will evaluate the model with 1 to 5 layers, with the expectation that at some point, the number of layers will result in a model that the chosen learning algorithm is unable to adapt to the training data.

# evaluate model and plot learning curve of model with given number of layers all_history = list() num_layers = [1, 2, 3, 4, 5] for n_layers in num_layers: # evaluate model with a given number of layers history, result = evaluate_model(n_layers, trainX, trainy, testX, testy) print('layers=%d: %.3f' % (n_layers, result)) # plot learning curve pyplot.plot(history.history['loss'], label=str(n_layers)) pyplot.legend() pyplot.show()

Tying these elements together, the complete example is listed below.

# study of mlp learning curves given different number of layers for multi-class classification from sklearn.datasets.samples_generator import make_blobs from keras.models import Sequential from keras.layers import Dense from keras.optimizers import SGD from keras.utils import to_categorical from matplotlib import pyplot # prepare multi-class classification dataset def create_dataset(): # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=20, n_features=100, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy # fit model with given number of layers, returns test set accuracy def evaluate_model(n_layers, trainX, trainy, testX, testy): # configure the model based on the data n_input, n_classes = trainX.shape[1], testy.shape[1] # define model model = Sequential() model.add(Dense(10, input_dim=n_input, activation='relu', kernel_initializer='he_uniform')) for _ in range(1, n_layers): model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(n_classes, activation='softmax')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test set _, test_acc = model.evaluate(testX, testy, verbose=0) return history, test_acc # get dataset trainX, trainy, testX, testy = create_dataset() # evaluate model and plot learning curve of model with given number of layers all_history = list() num_layers = [1, 2, 3, 4, 5] for n_layers in num_layers: # evaluate model with a given number of layers history, result = evaluate_model(n_layers, trainX, trainy, testX, testy) print('layers=%d: %.3f' % (n_layers, result)) # plot learning curve pyplot.plot(history.history['loss'], label=str(n_layers)) pyplot.legend() pyplot.show()

Running the example first prints the test accuracy for each model configuration.

Your specific results will vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model is capable of learning the problem well with up to three layers, then begins to falter. We can see that performance really drops with five layers and is expected to continue to fall if the number of layers is increased further.

layers=1: 1.000 layers=2: 1.000 layers=3: 1.000 layers=4: 0.948 layers=5: 0.794

A line plot is also created showing cross entropy loss on the training dataset for each model configuration (1 to 5 layers) over the 100 training epochs.

We can see that the dynamics of the model with 1, 2, and 3 models (blue, orange and green) are pretty similar, learning the problem quickly.

Surprisingly, training loss with four and five layers shows signs of initially doing well, then leaping up, suggesting that the model is likely stuck with a sub-optimal set of weights rather than overfitting the training dataset.

The analysis shows that increasing the capacity of the model via increasing depth is a very effective tool that must be used with caution as it can quickly result in a model with a large capacity that may not be capable of learning the training dataset easily.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Too Many Nodes**. Update the experiment of increasing nodes to find the point where the learning algorithm is no longer capable of learning the problem.**Repeated Evaluation**. Update an experiment to use the repeated evaluation of each configuration to counter the stochastic nature of the learning algorithm.**Harder Problem**. Repeat the experiment of increasing layers on a problem that requires the increased capacity provided by increased depth in order to perform well.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.
- Deep Learning, 2016.

- Keras Core Layers API
- Keras Convolutional Layers API
- Keras Recurrent Layers API
- Keras Utility Functions
- sklearn.datasets.make_blobs API

In this tutorial, you discovered how to control the capacity of a neural network model and how capacity impacts what a model is capable of learning.

Specifically, you learned:

- Neural network model capacity is controlled both by the number of nodes and the number of layers in the model.
- A model with a single hidden layer and a sufficient number of nodes has the capability of learning any mapping function, but the chosen learning algorithm may or may not be able to realize this capability.
- Increasing the number of layers provides a short-cut to increasing the capacity of the model with fewer resources, and modern techniques allow learning algorithms to successfully train deep models.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Control Neural Network Model Capacity With Nodes and Layers appeared first on Machine Learning Mastery.

]]>The post Framework for Better Deep Learning appeared first on Machine Learning Mastery.

]]>Modern deep learning libraries such as Keras allow you to define and start fitting a wide range of neural network models in minutes with just a few lines of code.

Nevertheless, it is still challenging to configure a neural network to get good performance on a new predictive modeling problem.

The challenge of getting good performance can be broken down into three main areas: problems with learning, problems with generalization, and problems with predictions.

Once you have diagnosed the specific type of problem that you are having with a network, a suite of classical and modern techniques can then be selected to address the issue and improve performance.

In this post, you will discover a framework for diagnosing performance problems with deep learning models and techniques that you can use to target and improve each specific performance problem.

After reading this post, you will know:

- Defining and fitting neural networks has never been easier, although getting good performance on new problems remains challenging.
- Neural network modeling performance problems can be decomposed into learning, generalization, and prediction type problems.
- There are decades of techniques as well as modern methods that can be used to target each type of model performance problem.

Let’s get started.

This tutorial is divided into seven parts; they are:

- Neural Network Renaissance
- Challenge of Configuring Neural Networks
- Framework for Systematically Better Deep Learning
- Better Learning Techniques
- Better Generalization Techniques
- Better Predictions Techniques
- How to Use the Framework

Historically, neural network models had to be coded from scratch.

You might spend days or weeks translating poorly described mathematics into code and days or weeks more debugging your code just to get a simple neural network model to run.

Those days are in the past.

Today, you can define and begin fitting most types of neural networks in minutes with just a few lines of code, thanks to open source libraries such as Keras built on top of sophisticated mathematical libraries such as TensorFlow.

This means that standard models such as Multilayer Perceptrons can be developed and evaluated rapidly, as well as more sophisticated models that may previously have been beyond the capabilities of most practitioners to implement such as Convolutional Neural Networks and Recurrent Neural Networks like the Long Short-Term Memory network.

As deep learning practitioners, we live in amazing and productive times.

Nevertheless, even through new neural network models can be defined and evaluated rapidly, there remains little guidance on how to actually configure neural network models in order to get the most out of them.

Configuring neural network models is often referred to as a “*dark art*.”

This is because there are no hard and fast rules for configuring a network for a given problem. We cannot analytically calculate the optimal model type or model configuration for a given dataset.

Instead, there are decades worth of techniques, heuristics, tips, tricks, and other tacit knowledge spread across code, papers, blog posts, and in peoples heads.

A shortcut to configuring a neural network on a problem is to copy the configuration of another network for a similar problem. But this strategy rarely leads to good results as model configurations are not transferable across problems. It is also likely that you work on predictive modeling problems that are most unlike other problems described in the literature.

Fortunately, there are techniques that are known to address specific issues when configuring and training a neural network that are available in modern deep learning libraries like Keras.

Further, discoveries have been made in the past 5 to 10 years in areas such as activation functions, adaptive learning rates, regularization methods, and ensemble techniques that have been shown to dramatically improve the performance of neural network models regardless of their specific type.

The techniques are available; you just need to know what they are and when to use them.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Unfortunately, you cannot simply grid search across the techniques used to improve deep learning performance.

Almost universally, they uniquely change aspects of the training data, learning process, model architecture, and more. Instead, you must diagnose the type of performance problem you are having with your model, then carefully choose and evaluate a given intervention tailored to that diagnosed problem.

There are three types of problems that are straightforward to diagnose with regard to poor performance of a deep learning neural network model; they are:

**Problems with Learning**. Problems with learning manifest in a model that cannot effectively learn a training dataset or shows slow progress or bad performance when learning the training dataset.**Problems with Generalization**. Problems with generalization manifest in a model that overfits the training dataset and makes poor performance on a holdout dataset.**Problems with Predictions**. Problems with predictions manifest in the stochastic training algorithm having a strong influence on the final model, causing a high variance in behavior and performance.

This breakdown provides a systematic approach to thinking about the performance of your deep learning model.

There is some natural overlap and interaction between these areas of concern. For example, problems with learning affect the ability of the model to generalize as well as the variance in the predictions made from a final model.

The sequential relationship between the three areas in the proposed breakdown allows the issue of deep learning model performance to be first isolated, then targeted with a specific technique or methodology.

We can summarize techniques that assist with each of these problems as follows:

**Better Learning**. Techniques that improve or accelerate the adaptation of neural network model weights in response to a training dataset.**Better Generalization**. Techniques that improve the performance of a neural network model on a holdout dataset.**Better Predictions**. Techniques that reduce the variance in the performance of a final model.

Now that we have a framework for systematically diagnosing a performance problem with a deep learning neural network, let’s take a look at some examples of techniques that may be used in each of these three areas of concern.

Better learning techniques are those changes to a neural network model or learning algorithm that improve or accelerate the adaptation of the model weights in response to a training dataset.

In this section, we will review the techniques used to improve the adaptation of the model weights.

This begins with the careful configuration of the hyperparameters related to optimizing the neural network model using the stochastic gradient descent algorithm and updating the weights using the backpropagation of error algorithm; for example:

**Configure Batch Size**. Including exploring whether variations such as batch, stochastic (online), or mini-batch gradient descent are more appropriate.**Configure Learning Rate**. Including understanding the effect of different learning rates on your problem and whether modern adaptive learning rate methods such as Adam would be appropriate.**Configure Loss Function**. Including understand the way different loss functions must be interpreted and whether an alternate loss function would be appropriate for your problem.

This also includes simple data preparation and the automatic rescaling of inputs at deeper layers.

**Data Scaling Techniques**. Including the sensitivity that small network weights have to the scale of input variables and the impact of large errors in the target variable have on weight updates.**Batch Normalization**. Including the sensitivity to changes in the distribution of inputs to layers deep in a network model and the benefits of standardizing layer inputs to add consistency of input and stability to the learning process.

Stochastic gradient descent is a general optimization algorithm that can be applied to a wide range of problems. Nevertheless, the optimization process (or learning process) can become unstable and specific interventions are required; for example:

**Vanishing Gradients**. Prevent the training of deep multiple-layered networks causing layers close to the input layer to not have their weights updated; that can be addressed using modern activation functions such as the rectified linear activation function.**Exploding Gradients**. Large weight updates cause a numerical overflow or underflow making the network weights take on a NaN or Inf value; that can be addressed using gradient scaling or gradient clipping.

The limitation of data on some predictive modeling problems can prevent effective learning. Specialized techniques can be used to jump-start the optimization process, providing a useful initial set of weights or even whole models that can be used for feature extraction; for example:

**Greedy Layer-Wise Pretraining**. Where layers are added one at a time to a model, learning to interpret the output of prior layers and permitting the development of much deeper models: a milestone technique in the field of deep learning.**Transfer Learning**. Where a model is trained on a different, but somehow related, predictive modeling problem and then used to seed the weights or used wholesale as a feature extraction model to provide input to a model trained on the problem of interest.

Are there additional techniques that you use to improve learning?

Let me know in the comments below.

Better generalization techniques are those that change the neural network model or learning algorithm to reduce the effect of the model overfitting the training dataset and improve the performance of the model on a holdout validation or test dataset.

In this section, we will review the techniques to reduce generalization of the model during training.

Techniques that are designed to reduce generalization error are commonly referred to as regularization techniques. Almost universally, regularization is achieved by somehow reducing or limiting model complexity.

Perhaps the most widely understood measure of model complexity is the size or magnitude of the model weights. A model with large weights is a sign that it may be overly specialized to the inputs in the training data, making it unstable when used when making a prediction on new unseen data. Keeping weights small via weight regularization is a powerful and widely used technique.

**Weight Regularization**. A change to the loss function that penalizes a model in proportion to the norm (magnitude) of the model weights, encouraging smaller weights and, in turn, a lower complexity model.

Rather than simply encouraging the weights to remain small via an updated loss function, it is possible to force the weights to be small using a constraint.

**Weight Constraint**. Update to the model to rescale the weights when the vector norm of the weights exceeds a threshold.

The output of a neural network layer, regardless of where that layer is in the stack of layers, can be thought of as an internal representation or set of extracted features with regard to the input. Simpler internal representations can have a regularizing effect on the model and can be encouraged through constraints that encourage sparsity (zero values).

**Activity Regularization**. A change to the loss function that penalized a model in proportion to the norm (magnitude) of the layer activations, encouraging smaller or more sparse internal representations.

Noise can be added to the model to encourage robustness with regard to the raw inputs or outputs from prior layers during training; for example:

**Input Noise**. Addition of statistical variation or noise at the input layer or between hidden layers to reduce the model’s dependence on specific input values.**Dropout**. Probabilistically removing connections (weights) while training the network to break tight coupling between nodes across layers.

Often, overfitting can occur due simply to training the model for too long on the training dataset. A simple solution is to stop the training early.

**Early Stopping**. Monitor model performance on the hold out validation dataset during training and stop the training process when performance on the validation set starts to degrade.

Are there additional techniques that you use to improve generalization?

Let me know in the comments below.

Better prediction techniques are those that complement the model training process in order to reduce the variance in the expected performance of the final model.

In this section, we will review the techniques to reduce the expected variance of a final deep learning neural network model.

The variance in the performance of the final model can be reduced by adding bias. The most common way to introduce bias to the final model is to combine the predictions from multiple models. This is referred to as ensemble learning.

More than reducing the variance of the performance of a final model, ensemble learning can also result in better predictive performance.

Effective ensemble learning methods require that each contributing model have skill, meaning that the models make predictions that are better than random, but that the prediction errors between the models have a low correlation. This means, that the ensemble member models should have skill, but in different ways.

This can be achieved by varying one aspect of the ensemble; for example:

- Vary the training data used to fit each member.
- Vary the members that contribute to the ensemble prediction.
- Vary the way that the predictions from the ensemble members are combined.

The training data can be varied by fitting models on different subsamples of the dataset.

This might involve fitting and retaining models on different randomly selected subsets of the training dataset, retaining models for each fold in a k-fold cross-validation, or retaining models across different samples with replacement using the bootstrap method (e.g. bootstrap aggregation). Collectively, we can think of these methods as resampling ensembles.

**Resampling Ensemble**. Ensemble of models fit on different samples of the training dataset.

Perhaps the simplest way to vary the members of the ensemble involves gathering models from multiple runs of the learning algorithm on the training dataset. The stochastic learning algorithm will cause a slightly different fit on each run that, in turn, will have a slightly different fit. Averaging the models across multiple runs will ensure the performance remains consistent.

**Model Averaging Ensemble**. Retrain models across multiple runs of the same learning algorithm on the same dataset.

Variations on this approach may involve training models with different hyperparameter configurations.

It can be expensive to train multiple final deep learning models, especially when one model may take days or weeks to fit.

An alternative is to collect models for use as contributing ensemble members during a single training run; for example:

**Horizontal Ensemble**. Ensemble members collected from a contiguous block of training epochs towards the end of a single training run.**Snapshot Ensemble**. A training run using an aggressive cyclic learning rate where ensemble members are collected at the trough of each cycle of the learning rate.

The simplest way to combine the predictions from multiple ensemble members is to calculate the average of the predictions in the case of regression, or the statistical mode or most frequent prediction in the case of classification.

Alternately, the best way to combine the predictions from multiple models can be learned; for example:

**Weighted Average Ensemble (blending)**. The contribution from each ensemble member to an ensemble prediction is weighted using learned coefficients that indicates the trust in each model.**Stacked Generalization (stacking)**. A new model is trained to learn how to best combine the predictions from the ensemble members.

An alternative to combining the predictions from the ensemble members, the models themselves may be combined; for example:

**Average Model Weight Ensemble**. Weights from multiple neural network models are averaged into a single model used to make a prediction.

Are there additional techniques that you use to reduce the variance of the final model?

Let me know in the comments below.

We can think of the organization of techniques into the three areas of better learning, generalization, and prediction as a systematic framework for improving the performance of your neural network model.

There are too many techniques to reasonably investigate and evaluate each in your project.

Instead, you need to be methodical and use the techniques in a targeted way to address a defined problem.

The first step in using this framework is to diagnose the performance problem that you are having with your model.

A robust diagnostic tool is to calculate a learning curve of loss and a problem-specific metric (like RMSE for regression or accuracy for classification) on a train and validation dataset over a given number of training epochs.

- If the loss on the training dataset is poor, stuck, or fails to improve, perhaps you have a learning problem.
- If the loss or problem-specific metric on the training dataset continues to improve and gets worse on the validation dataset, perhaps you have a generalization problem.
- If the loss or problem-specific metric on the validation dataset shows a high variance towards the end of the run, perhaps you have a prediction problem.

Review the techniques that are designed to address your problem.

Select a technique that appears to be a good fit for your model and problem. This may require some prior experience with the techniques and may be challenging for a beginner.

Thankfully, there are heuristics and best-practices that work well on most problems.

For example:

**Learning Problem**: Dialing-in the hyperparameters of the learning algorithm; specifically, the learning rate offers the biggest leverage.**Generalization Problem**: Using weight regularization and early stopping works well on most models with most problems, or try dropout with early stopping.**Prediction Problem**: Average the prediction from models collected over multiple runs or multiple epochs on one run to add sufficient bias.

Pick an intervention, then read-up a little bit on it, including how it works, why it works, and importantly, find examples for how practitioners before you have used it to get an idea for how you might use it on your problem.

Once you have identified an issue and addressed it with an intervention, repeat the process.

Developing a better model is an iterative process that may require multiple interventions at multiple levels that complement each other.

This is an empirical process. This means that you are reliant on the robustness of your test harness to give you a reliable summary of performance before and after an intervention. Spend the time to ensure your test harness is robust, guarantee that the train, test, and validation datasets are clean and provide a suitably representative sample of observation from your problem domain.

This section provides more resources on the topic if you are looking to go deeper.

- Deep Learning, 2016.
- Pattern Recognition and Machine Learning, 2006.
- Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks, 1999.

In this post, you discovered a framework for diagnosing performance problems with deep learning models and techniques that you can use to target and improve each specific performance problem.

Specifically, you learned:

- Defining and fitting neural networks has never been easier, although getting good performance on new problems remains challenging.
- Neural network modeling performance problems can be decomposed into learning, generalization, and prediction type problems.
- There are decades of techniques as well as modern methods that can be used to target each type of model performance problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post Framework for Better Deep Learning appeared first on Machine Learning Mastery.

]]>The post How to Improve Performance With Transfer Learning for Deep Learning Neural Networks appeared first on Machine Learning Mastery.

]]>An interesting benefit of deep learning neural networks is that they can be reused on related problems.

Transfer learning refers to a technique for predictive modeling on a different but somehow similar problem that can then be reused partly or wholly to accelerate the training and improve the performance of a model on the problem of interest.

In deep learning, this means reusing the weights in one or more layers from a pre-trained network model in a new model and either keeping the weights fixed, fine tuning them, or adapting the weights entirely when training the model.

In this tutorial, you will discover how to use transfer learning to improve the performance deep learning neural networks in Python with Keras.

After completing this tutorial, you will know:

- Transfer learning is a method for reusing a model trained on a related predictive modeling problem.
- Transfer learning can be used to accelerate the training of neural networks as either a weight initialization scheme or feature extraction method.
- How to use transfer learning to improve the performance of an MLP for a multiclass classification problem.

Let’s get started.

This tutorial is divided into six parts; they are:

- What Is Transfer Learning?
- Blobs Multi-Class Classification Problem
- Multilayer Perceptron Model for Problem 1
- Standalone MLP Model for Problem 2
- MLP With Transfer Learning for Problem 2
- Comparison of Models on Problem 2

Transfer learning generally refers to a process where a model trained on one problem is used in some way on a second related problem.

Transfer learning and domain adaptation refer to the situation where what has been learned in one setting (i.e., distribution P1) is exploited to improve generalization in another setting (say distribution P2).

— Page 536, Deep Learning, 2016.

In deep learning, transfer learning is a technique whereby a neural network model is first trained on a problem similar to the problem that is being solved. One or more layers from the trained model are then used in a new model trained on the problem of interest.

This is typically understood in a supervised learning context, where the input is the same but the target may be of a different nature. For example, we may learn about one set of visual categories, such as cats and dogs, in the first setting, then learn about a different set of visual categories, such as ants and wasps, in the second setting.

— Page 536, Deep Learning, 2016.

Transfer learning has the benefit of decreasing the training time for a neural network model and resulting in lower generalization error.

There are two main approaches to implementing transfer learning; they are:

- Weight Initialization.
- Feature Extraction.

The weights in re-used layers may be used as the starting point for the training process and adapted in response to the new problem. This usage treats transfer learning as a type of weight initialization scheme. This may be useful when the first related problem has a lot more labeled data than the problem of interest and the similarity in the structure of the problem may be useful in both contexts.

… the objective is to take advantage of data from the first setting to extract information that may be useful when learning or even when directly making predictions in the second setting.

— Page 538, Deep Learning, 2016.

Alternately, the weights of the network may not be adapted in response to the new problem, and only new layers after the reused layers may be trained to interpret their output. This usage treats transfer learning as a type of feature extraction scheme. An example of this approach is the re-use of deep convolutional neural network models trained for photo classification as feature extractors when developing photo captioning models.

Variations on these usages may involve not training the weights of the model on the new problem initially, but later fine tuning all weights of the learned model with a small learning rate.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We will use a small multi-class classification problem as the basis to demonstrate transfer learning.

The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

We can configure the problem to have two input variables (to represent the *x* and *y* coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.

# generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=1)

The results are the input and output elements of a dataset that we can model.

The “*random_state*” argument can be varied to give different versions of the problem (different cluster centers). We can use this to generate samples from two different problems: train a model on one problem and re-use the weights to better learn a model for a second problem.

Specifically, we will refer to *random_state=1* as Problem 1 and *random_state=2* as Problem 2.

**Problem 1**. Blobs problem with two input variables and three classes with the*random_state*argument set to one.**Problem 2**. Blobs problem with two input variables and three classes with the*random_state*argument set to two.

In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.

The complete example is listed below.

# plot of blobs multiclass classification problems 1 and 2 from sklearn.datasets.samples_generator import make_blobs from numpy import where from matplotlib import pyplot # generate samples for blobs problem with a given random seed def samples_for_seed(seed): # generate samples X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=seed) return X, y # create a scatter plot of points colored by class value def plot_samples(X, y, classes=3): # plot points for each class for i in range(classes): # select indices of points with each class label samples_ix = where(y == i) # plot points for this class with a given color pyplot.scatter(X[samples_ix, 0], X[samples_ix, 1]) # generate multiple problems n_problems = 2 for i in range(1, n_problems+1): # specify subplot pyplot.subplot(210 + i) # generate samples X, y = samples_for_seed(i) # scatter plot of samples plot_samples(X, y) # plot figure pyplot.show()

Running the example generates a sample of 1,000 examples for Problem 1 and Problem 2 and creates a scatter plot for each sample, coloring the data points by their class value.

This provides a good basis for transfer learning as each version of the problem has similar input data with a similar scale, although with different target information (e.g. cluster centers).

We would expect that aspects of a model fit on one version of the blobs problem (e.g. Problem 1) to be useful when fitting a model on a new version of the blobs problem (e.g. Problem 2).

In this section, we will develop a Multilayer Perceptron model (MLP) for Problem 1 and save the model to file so that we can reuse the weights later.

First, we will develop a function to prepare the dataset ready for modeling. After the make_blobs() function is called with a given random seed (e.g, one in this case for Problem 1), the target variable must be one hot encoded so that we can develop a model that predicts the probability of a given sample belonging to each of the target classes.

The prepared samples can then be split in half, with 500 examples for both the train and test datasets. The *samples_for_seed()* function below implements this, preparing the dataset for a given random number seed and re-tuning the train and test sets split into input and output components.

# prepare a blobs examples with a given random seed def samples_for_seed(seed): # generate samples X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=seed) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy

We can call this function to prepare a dataset for Problem 1 as follows.

# prepare data trainX, trainy, testX, testy = samples_for_seed(1)

Next, we can define and fit a model on the training dataset.

The model will expect two inputs for the two variables in the data. The model will have two hidden layers with five nodes each and the rectified linear activation function. Two layers are probably not required for this function, although we’re interested in the model learning some deep structure that we can reuse across instances of this problem. The output layer has three nodes, one for each class in the target variable and the softmax activation function.

# define model model = Sequential() model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(5, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax'))

Given that the problem is a multi-class classification problem, the categorical cross-entropy loss function is minimized and the stochastic gradient descent with the default learning rate and no momentum is used to learn the problem.

# compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

The model is fit for 100 epochs on the training dataset and the test set is used as a validation dataset during training, evaluating the performance on both datasets at the end of each epoch so that we can plot learning curves.

history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0)

The *fit_model()* function ties these elements together, taking the train and test datasets as arguments and returning the fit model and training history.

# define and fit model on a training dataset def fit_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(5, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) return model, history

We can call this function with the prepared dataset to obtain a fit model and the history collected during the training process.

# fit model on train dataset model, history = fit_model(trainX, trainy, testX, testy)

Finally, we can summarize the performance of the model.

The classification accuracy of the model on the train and test sets can be evaluated.

# evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc))

The history collected during training can be used to create line plots showing both the loss and classification accuracy for the model on the train and test sets over each training epoch, providing learning curves.

# plot loss during training pyplot.subplot(211) pyplot.title('Loss') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title('Accuracy') pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

The *summarize_model()* function below implements this, taking the fit model, training history, and dataset as arguments and printing the model performance and creating a plot of model learning curves.

# summarize the performance of the fit model def summarize_model(model, history, trainX, trainy, testX, testy): # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss during training pyplot.subplot(211) pyplot.title('Loss') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title('Accuracy') pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show()

We can call this function with the fit model and prepared data.

# evaluate model behavior summarize_model(model, history, trainX, trainy, testX, testy)

At the end of the run, we can save the model to file so that we may load it later and use it as the basis for some transfer learning experiments.

Note that saving the model to file requires that you have the *h5py* library installed. This library can be installed via *pip* as follows:

sudo pip install h5py

The fit model can be saved by calling the *save()* function on the model.

# save model to file model.save('model.h5')

Tying these elements together, the complete example of fitting an MLP on Problem 1, summarizing the model’s performance, and saving the model to file is listed below.

# fit mlp model on problem 1 and save model to file from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from matplotlib import pyplot # prepare a blobs examples with a given random seed def samples_for_seed(seed): # generate samples X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=seed) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy # define and fit model on a training dataset def fit_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(5, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) return model, history # summarize the performance of the fit model def summarize_model(model, history, trainX, trainy, testX, testy): # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss during training pyplot.subplot(211) pyplot.title('Loss') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title('Accuracy') pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show() # prepare data trainX, trainy, testX, testy = samples_for_seed(1) # fit model on train dataset model, history = fit_model(trainX, trainy, testX, testy) # evaluate model behavior summarize_model(model, history, trainX, trainy, testX, testy) # save model to file model.save('model.h5')

Running the example fits and evaluates the performance of the model, printing the classification accuracy on the train and test sets.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model performed well on Problem 1, achieving a classification accuracy of about 92% on both the train and test datasets.

Train: 0.916, Test: 0.920

A figure is also created summarizing the learning curves of the model, showing both the loss (top) and accuracy (bottom) for the model on both the train (blue) and test (orange) datasets at the end of each training epoch.

Your plot may not look identical but is expected to show the same general behavior. If not, try running the example a few times.

In this case, we can see that the model learned the problem reasonably quickly and well, perhaps converging in about 40 epochs and remaining reasonably stable on both datasets.

Now that we have seen how to develop a standalone MLP for the blobs Problem 1, we can look at the doing the same for Problem 2 that can be used as a baseline.

The example in the previous section can be updated to fit an MLP model to Problem 2.

It is important to get an idea of performance and learning dynamics on Problem 2 for a standalone model first as this will provide a baseline in performance that can be used to compare to a model fit on the same problem using transfer learning.

A single change is required that changes the call to *samples_for_seed()* to use the pseudorandom number generator seed of two instead of one.

# prepare data trainX, trainy, testX, testy = samples_for_seed(2)

For completeness, the full example with this change is listed below.

# fit mlp model on problem 2 and save model to file from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from matplotlib import pyplot # prepare a blobs examples with a given random seed def samples_for_seed(seed): # generate samples X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=seed) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy # define and fit model on a training dataset def fit_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(5, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) return model, history # summarize the performance of the fit model def summarize_model(model, history, trainX, trainy, testX, testy): # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss during training pyplot.subplot(211) pyplot.title('Loss') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title('Accuracy') pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show() # prepare data trainX, trainy, testX, testy = samples_for_seed(2) # fit model on train dataset model, history = fit_model(trainX, trainy, testX, testy) # evaluate model behavior summarize_model(model, history, trainX, trainy, testX, testy)

Running the example fits and evaluates the performance of the model, printing the classification accuracy on the train and test sets.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model performed okay on Problem 2, but not as well as was seen on Problem 1, achieving a classification accuracy of about 79% on both the train and test datasets.

Train: 0.794, Test: 0.794

A figure is also created summarizing the learning curves of the model. Your plot may not look identical but is expected to show the same general behavior. If not, try running the example a few times.

In this case, we can see that the model converged more slowly than we saw on Problem 1 in the previous section. This suggests that this version of the problem may be slightly more challenging, at least for the chosen model configuration.

Now that we have a baseline of performance and learning dynamics for an MLP on Problem 2, we can see how the addition of transfer learning affects the MLP on this problem.

The model that was fit on Problem 1 can be loaded and the weights can be used as the initial weights for a model fit on Problem 2.

This is a type of transfer learning where learning on a different but related problem is used as a type of weight initialization scheme.

This requires that the *fit_model()* function be updated to load the model and refit it on examples for Problem 2.

The model saved in ‘model.h5’ can be loaded using the *load_model()* Keras function.

# load model model = load_model('model.h5')

Once loaded, the model can be compiled and fit as per normal.

The updated *fit_model()* with this change is listed below.

# load and re-fit model on a training dataset def fit_model(trainX, trainy, testX, testy): # load model model = load_model('model.h5') # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # re-fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) return model, history

We would expect that a model that uses the weights from a model fit on a different but related problem to learn the problem perhaps faster in terms of the learning curve and perhaps result in lower generalization error, although these aspects would be dependent on the choice of problems and model.

For completeness, the full example with this change is listed below.

# transfer learning with mlp model on problem 2 from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from keras.models import load_model from matplotlib import pyplot # prepare a blobs examples with a given random seed def samples_for_seed(seed): # generate samples X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=seed) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy # load and re-fit model on a training dataset def fit_model(trainX, trainy, testX, testy): # load model model = load_model('model.h5') # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # re-fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) return model, history # summarize the performance of the fit model def summarize_model(model, history, trainX, trainy, testX, testy): # evaluate the model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_acc, test_acc)) # plot loss during training pyplot.subplot(211) pyplot.title('Loss') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() # plot accuracy during training pyplot.subplot(212) pyplot.title('Accuracy') pyplot.plot(history.history['acc'], label='train') pyplot.plot(history.history['val_acc'], label='test') pyplot.legend() pyplot.show() # prepare data trainX, trainy, testX, testy = samples_for_seed(2) # fit model on train dataset model, history = fit_model(trainX, trainy, testX, testy) # evaluate model behavior summarize_model(model, history, trainX, trainy, testX, testy)

Running the example fits and evaluates the performance of the model, printing the classification accuracy on the train and test sets.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that the model achieved a lower generalization error, achieving an accuracy of about 81% on the test dataset for Problem 2 as compared to the standalone model that achieved about 79% accuracy.

Train: 0.786, Test: 0.810

A figure is also created summarizing the learning curves of the model. Your plot may not look identical but is expected to show the same general behavior. If not, try running the example a few times.

In this case, we can see that the model does appear to have a similar learning curve, although we do see apparent improvements in the learning curve for the test set (orange line) both in terms of better performance earlier (epoch 20 onward) and above the performance of the model on the training set.

We have only looked at single runs of a standalone MLP model and an MLP with transfer learning.

Neural network algorithms are stochastic, therefore an average of performance across multiple runs is required to see if the observed behavior is real or a statistical fluke.

In order to determine whether using transfer learning for the blobs multi-class classification problem has a real effect, we must repeat each experiment multiple times and analyze the average performance across the repeats.

We will compare the performance of the standalone model trained on Problem 2 to a model using transfer learning, averaged over 30 repeats.

Further, we will investigate whether keeping the weights in some of the layers fixed improves model performance.

The model trained on Problem 1 has two hidden layers. By keeping the first or the first and second hidden layers fixed, the layers with unchangeable weights will act as a feature extractor and may provide features that make learning Problem 2 easier, affecting the speed of learning and/or the accuracy of the model on the test set.

As the first step, we will simplify the *fit_model()* function to fit the model and discard any training history so that we can focus on the final accuracy of the trained model.

# define and fit model on a training dataset def fit_model(trainX, trainy): # define model model = Sequential() model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(5, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) return model

Next, we can develop a function that will repeatedly fit a new standalone model on Problem 2 on the training dataset and evaluate accuracy on the test set.

The *eval_standalone_model()* function below implements this, taking the train and test sets as arguments as well as the number of repeats and returns a list of accuracy scores for models on the test dataset.

# repeated evaluation of a standalone model def eval_standalone_model(trainX, trainy, testX, testy, n_repeats): scores = list() for _ in range(n_repeats): # define and fit a new model on the train dataset model = fit_model(trainX, trainy) # evaluate model on test dataset _, test_acc = model.evaluate(testX, testy, verbose=0) scores.append(test_acc) return scores

Summarizing the distribution of accuracy scores returned from this function will give an idea of how well the chosen standalone model performs on Problem 2.

# repeated evaluation of standalone model standalone_scores = eval_standalone_model(trainX, trainy, testX, testy, n_repeats) print('Standalone %.3f (%.3f)' % (mean(standalone_scores), std(standalone_scores)))

Next, we need an equivalent function for evaluating a model using transfer learning.

In each loop, the model trained on Problem 1 must be loaded from file, fit on the training dataset for Problem 2, then evaluated on the test set for Problem 2.

In addition, we will configure 0, 1, or 2 of the hidden layers in the loaded model to remain fixed. Keeping 0 hidden layers fixed means that all of the weights in the model will be adapted when learning Problem 2, using transfer learning as a weight initialization scheme. Whereas, keeping both (2) of the hidden layers fixed means that only the output layer of the model will be adapted during training, using transfer learning as a feature extraction method.

The *eval_transfer_model()* function below implements this, taking the train and test datasets for Problem 2 as arguments as well as the number of hidden layers in the loaded model to keep fixed and the number of times to repeat the experiment.

The function returns a list of test accuracy scores and summarizing this distribution will give a reasonable idea of how well the model with the chosen type of transfer learning performs on Problem 2.

# repeated evaluation of a model with transfer learning def eval_transfer_model(trainX, trainy, testX, testy, n_fixed, n_repeats): scores = list() for _ in range(n_repeats): # load model model = load_model('model.h5') # mark layer weights as fixed or not trainable for i in range(n_fixed): model.layers[i].trainable = False # re-compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model on train dataset model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test dataset _, test_acc = model.evaluate(testX, testy, verbose=0) scores.append(test_acc) return scores

We can call this function repeatedly, setting n_fixed to 0, 1, 2 in a loop and summarizing performance as we go; for example:

# repeated evaluation of transfer learning model, vary fixed layers n_fixed = 3 for i in range(n_fixed): scores = eval_transfer_model(trainX, trainy, testX, testy, i, n_repeats) print('Transfer (fixed=%d) %.3f (%.3f)' % (i, mean(scores), std(scores)))

In addition to reporting the mean and standard deviation of each model, we can collect all scores and create a box and whisker plot to summarize and compare the distributions of model scores.

Tying all of the these elements together, the complete example is listed below.

# compare standalone mlp model performance to transfer learning from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from keras.models import load_model from matplotlib import pyplot from numpy import mean from numpy import std # prepare a blobs examples with a given random seed def samples_for_seed(seed): # generate samples X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=seed) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, trainy, testX, testy # define and fit model on a training dataset def fit_model(trainX, trainy): # define model model = Sequential() model.add(Dense(5, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(5, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) return model # repeated evaluation of a standalone model def eval_standalone_model(trainX, trainy, testX, testy, n_repeats): scores = list() for _ in range(n_repeats): # define and fit a new model on the train dataset model = fit_model(trainX, trainy) # evaluate model on test dataset _, test_acc = model.evaluate(testX, testy, verbose=0) scores.append(test_acc) return scores # repeated evaluation of a model with transfer learning def eval_transfer_model(trainX, trainy, testX, testy, n_fixed, n_repeats): scores = list() for _ in range(n_repeats): # load model model = load_model('model.h5') # mark layer weights as fixed or not trainable for i in range(n_fixed): model.layers[i].trainable = False # re-compile model model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy']) # fit model on train dataset model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model on test dataset _, test_acc = model.evaluate(testX, testy, verbose=0) scores.append(test_acc) return scores # prepare data for problem 2 trainX, trainy, testX, testy = samples_for_seed(2) n_repeats = 30 dists, dist_labels = list(), list() # repeated evaluation of standalone model standalone_scores = eval_standalone_model(trainX, trainy, testX, testy, n_repeats) print('Standalone %.3f (%.3f)' % (mean(standalone_scores), std(standalone_scores))) dists.append(standalone_scores) dist_labels.append('standalone') # repeated evaluation of transfer learning model, vary fixed layers n_fixed = 3 for i in range(n_fixed): scores = eval_transfer_model(trainX, trainy, testX, testy, i, n_repeats) print('Transfer (fixed=%d) %.3f (%.3f)' % (i, mean(scores), std(scores))) dists.append(scores) dist_labels.append('transfer f='+str(i)) # box and whisker plot of score distributions pyplot.boxplot(dists, labels=dist_labels) pyplot.show()

Running the example first reports the mean and standard deviation of classification accuracy on the test dataset for each model.

In this case, we can see that the standalone model achieved an accuracy of about 78% on Problem 2 with a large standard deviation of 10%. In contrast, we can see that the spread of all of the transfer learning models is much smaller, ranging from about 0.05% to 1.5%.

This difference in the standard deviations of the test accuracy scores shows the stability that transfer learning can bring to the model, reducing the variance in the performance of the final model introduced via the stochastic learning algorithm.

Comparing the mean test accuracy of the models, we can see that transfer learning that used the model as a weight initialization scheme (fixed=0) resulted in better performance than the standalone model with about 80% accuracy.

Keeping all hidden layers fixed (fixed=2) and using them as a feature extraction scheme resulted in worse performance on average than the standalone model. It suggests that the approach is too restrictive in this case.

Interestingly, we see best performance when the first hidden layer is kept fixed (fixed=1) and the second hidden layer is adapted to the problem with a test classification accuracy of about 81%. This suggests that in this case, the problem benefits from both the feature extraction and weight initialization properties of transfer learning.

It may be interesting to see how results of this last approach compare to the same model where the weights of the second hidden layer (and perhaps the output layer) are re-initialized with random numbers. This comparison would demonstrate whether the feature extraction properties of transfer learning alone or both feature extraction and weight initialization properties are beneficial.

Standalone 0.787 (0.101) Transfer (fixed=0) 0.805 (0.004) Transfer (fixed=1) 0.817 (0.005) Transfer (fixed=2) 0.750 (0.014)

A figure is created showing four box and whisker plots. The box shows the middle 50% of each data distribution, the orange line shows the median, and the dots show outliers.

The boxplot for the standalone model shows a number of outliers, indicating that on average, the model performs well, but there is a chance that it can perform very poorly.

Conversely, we see that the behavior of the models with transfer learning are more stable, showing a tighter distribution in performance.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Reverse Experiment**. Train and save a model for Problem 2 and see if it can help when using it for transfer learning on Problem 1.**Add Hidden Layer**. Update the example to keep both hidden layers fixed, but add a new hidden layer with randomly initialized weights after the fixed layers before the output layer and compare performance.**Randomly Initialize Layers**. Update the example to randomly initialize the weights of the second hidden layer and the output layer and compare performance.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Transfer Learning for Deep Learning
- How to Develop a Deep Learning Photo Caption Generator from Scratch

- Deep Learning of Representations for Unsupervised and Transfer Learning, 2011.
- Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach, 2011.
- Is Learning The n-th Thing Any Easier Than Learning The First?, 1996.

- Section 5.2 Transfer Learning and Domain Adaptation, Deep Learning, 2016.

In this tutorial, you discovered how to use transfer learning to improve the performance deep learning neural networks in Python with Keras.

Specifically, you learned:

- Transfer learning is a method for reusing a model trained on a related predictive modeling problem.
- Transfer learning can be used to accelerate the training of neural networks as either a weight initialization scheme or feature extraction method.
- How to use transfer learning to improve the performance of an MLP for a multiclass classification problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Improve Performance With Transfer Learning for Deep Learning Neural Networks appeared first on Machine Learning Mastery.

]]>The post How to Avoid Exploding Gradients in Neural Networks With Gradient Clipping appeared first on Machine Learning Mastery.

]]>Training a neural network can become unstable given the choice of error function, learning rate, or even the scale of the target variable.

Large updates to weights during training can cause a numerical overflow or underflow often referred to as “*exploding gradients*.”

The problem of exploding gradients is more common with recurrent neural networks, such as LSTMs given the accumulation of gradients unrolled over hundreds of input time steps.

A common and relatively easy solution to the exploding gradients problem is to change the derivative of the error before propagating it backward through the network and using it to update the weights. Two approaches include rescaling the gradients given a chosen vector norm and clipping gradient values that exceed a preferred range. Together, these methods are referred to as “*gradient clipping*.”

In this tutorial, you will discover the exploding gradient problem and how to improve neural network training stability using gradient clipping.

After completing this tutorial, you will know:

- Training neural networks can become unstable, leading to a numerical overflow or underflow referred to as exploding gradients.
- The training process can be made stable by changing the error gradients either by scaling the vector norm or clipping gradient values to a range.
- How to update an MLP model for a regression predictive modeling problem with exploding gradients to have a stable training process using gradient clipping methods.

Let’s get started.

This tutorial is divided into six parts; they are:

- Exploding Gradients and Clipping
- Gradient Clipping in Keras
- Regression Predictive Modeling Problem
- Multilayer Perceptron With Exploding Gradients
- MLP With Gradient Norm Scaling
- MLP With Gradient Value Clipping

Neural networks are trained using the stochastic gradient descent optimization algorithm.

This requires first the estimation of the loss on one or more training examples, then the calculation of the derivative of the loss, which is propagated backward through the network in order to update the weights. Weights are updated using a fraction of the back propagated error controlled by the learning rate.

It is possible for the updates to the weights to be so large that the weights either overflow or underflow their numerical precision. In practice, the weights can take on the value of an “*NaN*” or “*Inf*” when they overflow or underflow and for practical purposes the network will be useless from that point forward, forever predicting NaN values as signals flow through the invalid weights.

The difficulty that arises is that when the parameter gradient is very large, a gradient descent parameter update could throw the parameters very far, into a region where the objective function is larger, undoing much of the work that had been done to reach the current solution.

— Page 413, Deep Learning, 2016.

The underflow or overflow of weights is generally refers to as an instability of the network training process and is known by the name “*exploding gradients*” as the unstable training process causes the network to fail to train in such a way that the model is essentially useless.

In a given neural network, such as a Convolutional Neural Network or Multilayer Perceptron, this can happen due to a poor choice of configuration. Some examples include:

- Poor choice of learning rate that results in large weight updates.
- Poor choice of data preparation, allowing large differences in the target variable.
- Poor choice of loss function, allowing the calculation of large error values.

Exploding gradients is also a problem in recurrent neural networks such as the Long Short-Term Memory network given the accumulation of error gradients in the unrolled recurrent structure.

Exploding gradients can be avoided in general by careful configuration of the network model, such as choice of small learning rate, scaled target variables, and a standard loss function. Nevertheless, exploding gradients may still be an issue with recurrent networks with a large number of input time steps.

One difficulty when training LSTM with the full gradient is that the derivatives sometimes become excessively large, leading to numerical problems. To prevent this, [we] clipped the derivative of the loss with respect to the network inputs to the LSTM layers (before the sigmoid and tanh functions are applied) to lie within a predefined range.

— Generating Sequences With Recurrent Neural Networks, 2013.

A common solution to exploding gradients is to change the error derivative before propagating it backward through the network and using it to update the weights. By rescaling the error derivative, the updates to the weights will also be rescaled, dramatically decreasing the likelihood of an overflow or underflow.

There are two main methods for updating the error derivative; they are:

- Gradient Scaling.
- Gradient Clipping.

Gradient scaling involves normalizing the error gradient vector such that vector norm (magnitude) equals a defined value, such as 1.0.

… one simple mechanism to deal with a sudden increase in the norm of the gradients is to rescale them whenever they go over a threshold

— On the difficulty of training Recurrent Neural Networks, 2013.

Gradient clipping involves forcing the gradient values (element-wise) to a specific minimum or maximum value if the gradient exceeded an expected range.

Together, these methods are often simply referred to as “*gradient clipping*.”

When the traditional gradient descent algorithm proposes to make a very large step, the gradient clipping heuristic intervenes to reduce the step size to be small enough that it is less likely to go outside the region where the gradient indicates the direction of approximately steepest descent.

— Page 289, Deep Learning, 2016.

It is a method that only addresses the numerical stability of training deep neural network models and does not offer any general improvement in performance.

The value for the gradient vector norm or preferred range can be configured by trial and error, by using common values used in the literature or by first observing common vector norms or ranges via experimentation and then choosing a sensible value.

In our experiments we have noticed that for a given task and model size, training is not very sensitive to this [gradient norm] hyperparameter and the algorithm behaves well even for rather small thresholds.

— On the difficulty of training Recurrent Neural Networks, 2013.

It is common to use the same gradient clipping configuration for all layers in the network. Nevertheless, there are examples where a larger range of error gradients are permitted in the output layer compared to hidden layers.

The output derivatives […] were clipped in the range [−100, 100], and the LSTM derivatives were clipped in the range [−10, 10]. Clipping the output gradients proved vital for numerical stability; even so, the networks sometimes had numerical problems late on in training, after they had started overfitting on the training data.

— Generating Sequences With Recurrent Neural Networks, 2013.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Keras supports gradient clipping on each optimization algorithm, with the same scheme applied to all layers in the model

Gradient clipping can be used with an optimization algorithm, such as stochastic gradient descent, via including an additional argument when configuring the optimization algorithm.

Two types of gradient clipping can be used: gradient norm scaling and gradient value clipping.

Gradient norm scaling involves changing the derivatives of the loss function to have a given vector norm when the L2 vector norm (sum of the squared values) of the gradient vector exceeds a threshold value.

For example, we could specify a norm of 1.0, meaning that if the vector norm for a gradient exceeds 1.0, then the values in the vector will be rescaled so that the norm of the vector equals 1.0.

This can be used in Keras by specifying the “*clipnorm*” argument on the optimizer; for example:

.... # configure sgd with gradient norm clipping opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0)

Gradient value clipping involves clipping the derivatives of the loss function to have a given value if a gradient value is less than a negative threshold or more than the positive threshold.

For example, we could specify a norm of 0.5, meaning that if a gradient value was less than -0.5, it is set to -0.5 and if it is more than 0.5, then it will be set to 0.5.

This can be used in Keras by specifying the “clipvalue” argument on the optimizer, for example:

... # configure sgd with gradient value clipping opt = SGD(lr=0.01, momentum=0.9, clipvalue=0.5)

A regression predictive modeling problem involves predicting a real-valued quantity.

We can use a standard regression problem generator provided by the scikit-learn library in the make_regression() function. This function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.

We will use this function to define a problem that has 20 input features; 10 of the features will be meaningful and 10 will not be relevant. A total of 1,000 examples will be randomly generated. The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.

# generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

Each input variable has a Gaussian distribution, as does the target variable.

We can create plots of the target variable showing both the distribution and spread. The complete example is listed below.

# regression predictive modeling problem from sklearn.datasets import make_regression from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # histogram of target variable pyplot.subplot(121) pyplot.hist(y) # boxplot of target variable pyplot.subplot(122) pyplot.boxplot(y) pyplot.show()

Running the example creates a figure with two plots showing a histogram and a box and whisker plot of the target variable.

The histogram shows the Gaussian distribution of the target variable. The box and whisker plot shows that the range of samples varies between about -400 to 400 with a mean of about 0.0.

We can develop a Multilayer Perceptron (MLP) model for the regression problem.

A model will be demonstrated on the raw data, without any scaling of the input or output variables. This is a good example to demonstrate exploding gradients as a model trained to predict the unscaled target variable will result in error gradients with values in the hundreds or even thousands, depending on the batch size used during training. Such large gradient values are likely to lead to unstable learning or an overflow of the weight values.

The first step is to split the data into train and test sets so that we can fit and evaluate a model. We will generate 1,000 examples from the domain and split the dataset in half, using 500 examples for train and test sets.

# split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define an MLP model.

The model will expect 20 inputs in the 20 input variables in the problem. A single hidden layer will be used with 25 nodes and a rectified linear activation function. The output layer has one node for the single target variable and a linear activation function to predict real values directly.

# define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear'))

The mean squared error loss function will be used to optimize the model and the stochastic gradient descent optimization algorithm will be used with the sensible default configuration of a learning rate of 0.01 and a momentum of 0.9.

# compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))

The model will be fit for 100 training epochs and the test set will be used as a validation set, evaluated at the end of each training epoch.

The mean squared error is calculated on the train and test datasets at the end of training to get an idea of how well the model learned the problem.

# evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0)

Finally, learning curves of mean squared error on the train and test sets at the end of each training epoch are graphed using line plots, providing learning curves to get an idea of the dynamics of the model while learning the problem.

# plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Tying these elements together, the complete example is listed below.

# mlp with unscaled data for the regression problem from sklearn.datasets import make_regression from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example fits the model and calculates the mean squared error on the train and test sets.

In this case, the model is unable to learn the problem, resulting in predictions of NaN values. The model weights exploded during training given the very large errors and in turn error gradients calculated for weight updates.

Train: nan, Test: nan

This demonstrates that some intervention is required with regard to the target variable for the model to learn this problem.

A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error.

A traditional solution would be to rescale the target variable using either standardization or normalization, and this approach is recommended for MLPs. Nevertheless, an alternative that we will investigate in this case will be the use of gradient clipping.

We can update the training of the model in the previous section to add gradient norm scaling.

This can be implemented by setting the “*clipnorm*” argument on the optimizer.

For example, the gradients can be rescaled to have a vector norm (magnitude or length) of 1.0, as follows:

# compile model opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0) model.compile(loss='mean_squared_error', optimizer=opt)

The complete example with this change is listed below.

# mlp with unscaled data for the regression problem with gradient norm scaling from sklearn.datasets import make_regression from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model opt = SGD(lr=0.01, momentum=0.9, clipnorm=1.0) model.compile(loss='mean_squared_error', optimizer=opt) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example fits the model and evaluates it on the train and test sets, printing the mean squared error.

Your results may vary given the stochastic nature of the learning algorithm. You can try running the example a few times.

In this case, we can see that scaling the gradient with a vector norm of 1.0 has resulted in a stable model capable of learning the problem and converging on a solution.

Train: 5.082, Test: 27.433

A line plot is also created showing the means squared error loss on the train and test datasets over training epochs.

The plot shows how loss dropped from large values above 20,000 down to small values below 100 rapidly over 20 epochs.

There is nothing special about the vector norm value of 1.0, and other values could be evaluated and the performance of the resulting model compared.

Another solution to the exploding gradient problem is to clip the gradient if it becomes too large or too small.

We can update the training of the MLP to use gradient clipping by adding the “*clipvalue*” argument to the optimization algorithm configuration. For example, the code below clips the gradient to the range [-5 to 5].

# compile model opt = SGD(lr=0.01, momentum=0.9, clipvalue=5.0) model.compile(loss='mean_squared_error', optimizer=opt)

The complete example of training the MLP with gradient clipping is listed below.

# mlp with unscaled data for the regression problem with gradient clipping from sklearn.datasets import make_regression from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model opt = SGD(lr=0.01, momentum=0.9, clipvalue=5.0) model.compile(loss='mean_squared_error', optimizer=opt) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running this example fits the model and evaluates it on the train and test sets, printing the mean squared error.

Your results may vary given the stochastic nature of the learning algorithm. You can try running the example a few times.

We can see that in this case, the model is able to learn the problem without exploding gradients achieving an MSE of below 10 on both the train and test sets.

Train: 9.487, Test: 9.985

A line plot is also created showing the means squared error loss on the train and test datasets over training epochs.

The plot shows that the model learns the problem fast, achieving a sub-100 MSE loss within just a few training epochs.

A clipped range of [-5, 5] was chosen arbitrarily; you can experiment with different sized ranges and compare performance of the speed of learning and final model performance.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Vector Norm Values**. Update the example to evaluate different gradient vector norm values and compare performance.**Vector Clip Values**. Update the example to evaluate different gradient value ranges and compare performance.**Vector Norm and Clip**. Update the example to use a combination of vector norm scaling and vector value clipping on the same training run and compare performance.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- Section 8.2.4 Cliffs and Exploding Gradients, Deep Learning, 2016.
- Section 10.11.1 Clipping Gradients, Deep Learning, 2016.

- On the difficulty of training Recurrent Neural Networks, 2013.
- Generating Sequences With Recurrent Neural Networks, 2013.

In this tutorial, you discovered the exploding gradient problem and how to improve neural network training stability using gradient clipping.

Specifically, you learned:

- Training neural networks can become unstable, leading to a numerical overflow or underflow referred to as exploding gradients.
- The training process can be made stable by changing the error gradients, either by scaling the vector norm or clipping gradient values to a range.
- How to update an MLP model for a regression predictive modeling problem with exploding gradients to have a stable training process using gradient clipping methods.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Avoid Exploding Gradients in Neural Networks With Gradient Clipping appeared first on Machine Learning Mastery.

]]>The post How to Improve Neural Network Stability and Modeling Performance With Data Scaling appeared first on Machine Learning Mastery.

]]>Deep learning neural networks learn how to map inputs to outputs from examples in a training dataset.

The weights of the model are initialized to small random values and updated via an optimization algorithm in response to estimates of error on the training dataset.

Given the use of small weights in the model and the use of error between predictions and expected values, the scale of inputs and outputs used to train the model are an important factor. Unscaled input variables can result in a slow or unstable learning process, whereas unscaled target variables on regression problems can result in exploding gradients causing the learning process to fail.

Data preparation involves using techniques such as the normalization and standardization to rescale input and output variables prior to training a neural network model.

In this tutorial, you will discover how to improve neural network stability and modeling performance by scaling data.

After completing this tutorial, you will know:

- Data scaling is a recommended pre-processing step when working with deep learning neural networks.
- Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
- How to apply standardization and normalization to improve the performance of a Multilayer Perceptron model on a regression predictive modeling problem.

Let’s get started.

This tutorial is divided into six parts; they are:

- The Scale of Your Data Matters
- Data Scaling Methods
- Regression Predictive Modeling Problem
- Multilayer Perceptron With Unscaled Data
- Multilayer Perceptron With Scaled Output Variables
- Multilayer Perceptron With Scaled Input Variables

Deep learning neural network models learn a mapping from input variables to an output variable.

As such, the scale and distribution of the data drawn from the domain may be different for each variable.

Input variables may have different units (e.g. feet, kilometers, and hours) that, in turn, may mean the variables have different scales.

Differences in the scales across input variables may increase the difficulty of the problem being modeled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error.

One of the most common forms of pre-processing consists of a simple linear rescaling of the input variables.

— Page 298, Neural Networks for Pattern Recognition, 1995.

A target variable with a large spread of values, in turn, may result in large error gradient values causing weight values to change dramatically, making the learning process unstable.

Scaling input and output variables is a critical step in using neural network models.

In practice it is nearly always advantageous to apply pre-processing transformations to the input data before it is presented to a network. Similarly, the outputs of the network are often post-processed to give the required output values.

— Page 296, Neural Networks for Pattern Recognition, 1995.

The input variables are those that the network takes on the input or visible layer in order to make a prediction.

A good rule of thumb is that input variables should be small values, probably in the range of 0-1 or standardized with a zero mean and a standard deviation of one.

Whether input variables require scaling depends on the specifics of your problem and of each variable.

You may have a sequence of quantities as inputs, such as prices or temperatures.

If the distribution of the quantity is normal, then it should be standardized, otherwise the data should be normalized. This applies if the range of quantity values is large (10s, 100s, etc.) or small (0.01, 0.0001).

If the quantity values are small (near 0-1) and the distribution is limited (e.g. standard deviation near 1) then perhaps you can get away with no scaling of the data.

Problems can be complex and it may not be clear how to best scale input data.

If in doubt, normalize the input sequence. If you have the resources, explore modeling with the raw data, standardized data, and normalized data and see if there is a beneficial difference in the performance of the resulting model.

If the input variables are combined linearly, as in an MLP [Multilayer Perceptron], then it is rarely strictly necessary to standardize the inputs, at least in theory. […] However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

The output variable is the variable predicted by the network.

You must ensure that the scale of your output variable matches the scale of the activation function (transfer function) on the output layer of your network.

If your output activation function has a range of [0,1], then obviously you must ensure that the target values lie within that range. But it is generally better to choose an output activation function suited to the distribution of the targets than to force your data to conform to the output activation function.

— Should I normalize/standardize/rescale the data? Neural Nets FAQ

If your problem is a regression problem, then the output will be a real value.

This is best modeled with a linear activation function. If the distribution of the value is normal, then you can standardize the output variable. Otherwise, the output variable can be normalized.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

There are two types of scaling of your data that you may want to consider: normalization and standardization.

These can both be achieved using the scikit-learn library.

Normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

A value is normalized as follows:

y = (x - min) / (max - min)

Where the minimum and maximum values pertain to the value *x* being normalized.

For example, for a dataset, we could guesstimate the min and max observable values as 30 and -10. We can then normalize any value, like 18.8, as follows:

y = (x - min) / (max - min) y = (18.8 - (-10)) / (30 - (-10)) y = 28.8 / 40 y = 0.72

You can see that if an *x* value is provided that is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

Good practice usage with the *MinMaxScaler* and other scaling techniques is as follows:

**Fit the scaler using available training data**. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the*fit()*function.**Apply the scale to training data**. This means you can use the normalized data to train your model. This is done by calling the*transform()*function.**Apply the scale to data going forward**. This means you can prepare new data in the future on which you want to make predictions.

The default scale for the *MinMaxScaler* is to rescale variables into the range [0,1], although a preferred scale can be specified via the “*feature_range*” argument and specify a tuple including the min and the max for all variables.

# create scaler scaler = MinMaxScaler(feature_range=(-1,1))

If needed, the transform can be inverted. This is useful for converting predictions back into their original scale for reporting or plotting. This can be done by calling the *inverse_transform()* function.

The example below provides a general demonstration for using the *MinMaxScaler* to normalize data.

# demonstrate data normalization with sklearn from sklearn.preprocessing import MinMaxScaler # load data data = ... # create scaler scaler = MinMaxScaler() # fit scaler on data scaler.fit(data) # apply transform normalized = scaler.transform(data) # inverse transform inverse = scaler.inverse_transform(normalized)

You can also perform the fit and transform in a single step using the *fit_transform()* function; for example:

# demonstrate data normalization with sklearn from sklearn.preprocessing import MinMaxScaler # load data data = ... # create scaler scaler = MinMaxScaler() # fit and transform in one step normalized = scaler.fit_transform(data) # inverse transform inverse = scaler.inverse_transform(normalized)

Standardizing a dataset involves rescaling the distribution of values so that the mean of observed values is 0 and the standard deviation is 1. It is sometimes referred to as “*whitening*.”

This can be thought of as subtracting the mean value or centering the data.

Like normalization, standardization can be useful, and even required in some machine learning algorithms when your data has input values with differing scales.

Standardization assumes that your observations fit a Gaussian distribution (bell curve) with a well behaved mean and standard deviation. You can still standardize your data if this expectation is not met, but you may not get reliable results.

Standardization requires that you know or are able to accurately estimate the mean and standard deviation of observable values. You may be able to estimate these values from your training data.

A value is standardized as follows:

y = (x - mean) / standard_deviation

Where the *mean* is calculated as:

mean = sum(x) / count(x)

And the *standard_deviation* is calculated as:

standard_deviation = sqrt( sum( (x - mean)^2 ) / count(x))

We can guesstimate a mean of 10 and a standard deviation of about 5. Using these values, we can standardize the first value of 20.7 as follows:

y = (x - mean) / standard_deviation y = (20.7 - 10) / 5 y = (10.7) / 5 y = 2.14

The mean and standard deviation estimates of a dataset can be more robust to new data than the minimum and maximum.

You can standardize your dataset using the scikit-learn object StandardScaler.

# demonstrate data standardization with sklearn from sklearn.preprocessing import StandardScaler # load data data = ... # create scaler scaler = StandardScaler() # fit scaler on data scaler.fit(data) # apply transform standardized = scaler.transform(data) # inverse transform inverse = scaler.inverse_transform(standardized)

You can also perform the fit and transform in a single step using the *fit_transform()* function; for example:

# demonstrate data standardization with sklearn from sklearn.preprocessing import StandardScaler # load data data = ... # create scaler scaler = StandardScaler() # fit and transform in one step standardized = scaler.fit_transform(data) # inverse transform inverse = scaler.inverse_transform(standardized)

A regression predictive modeling problem involves predicting a real-valued quantity.

We can use a standard regression problem generator provided by the scikit-learn library in the make_regression() function. This function will generate examples from a simple regression problem with a given number of input variables, statistical noise, and other properties.

We will use this function to define a problem that has 20 input features; 10 of the features will be meaningful and 10 will not be relevant. A total of 1,000 examples will be randomly generated. The pseudorandom number generator will be fixed to ensure that we get the same 1,000 examples each time the code is run.

# generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1)

Each input variable has a Gaussian distribution, as does the target variable.

We can demonstrate this by creating histograms of some of the input variables and the output variable.

# regression predictive modeling problem from sklearn.datasets import make_regression from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # histograms of input variables pyplot.subplot(211) pyplot.hist(X[:, 0]) pyplot.subplot(212) pyplot.hist(X[:, 1]) pyplot.show() # histogram of target variable pyplot.hist(y) pyplot.show()

Running the example creates two figures.

The first shows histograms of the first two of the twenty input variables, showing that each has a Gaussian data distribution.

The second figure shows a histogram of the target variable, showing a much larger range for the variable as compared to the input variables and, again, a Gaussian data distribution.

Now that we have a regression problem that we can use as the basis for the investigation, we can develop a model to address it.

We can develop a Multilayer Perceptron (MLP) model for the regression problem.

A model will be demonstrated on the raw data, without any scaling of the input or output variables. We expect that model performance will be generally poor.

The first step is to split the data into train and test sets so that we can fit and evaluate a model. We will generate 1,000 examples from the domain and split the dataset in half, using 500 examples for the train and test datasets.

# split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:]

Next, we can define an MLP model. The model will expect 20 inputs in the 20 input variables in the problem.

A single hidden layer will be used with 25 nodes and a rectified linear activation function. The output layer has one node for the single target variable and a linear activation function to predict real values directly.

# define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear'))

The mean squared error loss function will be used to optimize the model and the stochastic gradient descent optimization algorithm will be used with the sensible default configuration of a learning rate of 0.01 and a momentum of 0.9.

# compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9))

The model will be fit for 100 training epochs and the test set will be used as a validation set, evaluated at the end of each training epoch.

The mean squared error is calculated on the train and test datasets at the end of training to get an idea of how well the model learned the problem.

# evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0)

Finally, learning curves of mean squared error on the train and test sets at the end of each training epoch are graphed using line plots, providing learning curves to get an idea of the dynamics of the model while learning the problem.

# plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Tying these elements together, the complete example is listed below.

# mlp with unscaled data for the regression problem from sklearn.datasets import make_regression from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example fits the model and calculates the mean squared error on the train and test sets.

In this case, the model is unable to learn the problem, resulting in predictions of NaN values. The model weights exploded during training given the very large errors and, in turn, error gradients calculated for weight updates.

Train: nan, Test: nan

This demonstrates that, at the very least, some data scaling is required for the target variable.

A line plot of training history is created but does not show anything as the model almost immediately results in a NaN mean squared error.

The MLP model can be updated to scale the target variable.

Reducing the scale of the target variable will, in turn, reduce the size of the gradient used to update the weights and result in a more stable model and training process.

Given the Gaussian distribution of the target variable, a natural method for rescaling the variable would be to standardize the variable. This requires estimating the mean and standard deviation of the variable and using these estimates to perform the rescaling.

It is best practice is to estimate the mean and standard deviation of the training dataset and use these variables to scale the train and test dataset. This is to avoid any data leakage during the model evaluation process.

The scikit-learn transformers expect input data to be matrices of rows and columns, therefore the 1D arrays for the target variable will have to be reshaped into 2D arrays prior to the transforms.

# reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1)

We can then create and apply the *StandardScaler* to rescale the target variable.

# created scaler scaler = StandardScaler() # fit scaler on training dataset scaler.fit(trainy) # transform training dataset trainy = scaler.transform(trainy) # transform test dataset testy = scaler.transform(testy)

Rescaling the target variable means that estimating the performance of the model and plotting the learning curves will calculate an MSE in squared units of the scaled variable rather than squared units of the original scale. This can make interpreting the error within the context of the domain challenging.

In practice, it may be helpful to estimate the performance of the model by first inverting the transform on the test dataset target variable and on the model predictions and estimating model performance using the root mean squared error on the unscaled data. This is left as an exercise to the reader.

The complete example of standardizing the target variable for the MLP on the regression problem is listed below.

# mlp with scaled outputs on the regression problem from sklearn.datasets import make_regression from sklearn.preprocessing import StandardScaler from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) # created scaler scaler = StandardScaler() # fit scaler on training dataset scaler.fit(trainy) # transform training dataset trainy = scaler.transform(trainy) # transform test dataset testy = scaler.transform(testy) # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model history = model.fit(trainX, trainy, validation_data=(testX, testy), epochs=100, verbose=0) # evaluate the model train_mse = model.evaluate(trainX, trainy, verbose=0) test_mse = model.evaluate(testX, testy, verbose=0) print('Train: %.3f, Test: %.3f' % (train_mse, test_mse)) # plot loss during training pyplot.title('Loss / Mean Squared Error') pyplot.plot(history.history['loss'], label='train') pyplot.plot(history.history['val_loss'], label='test') pyplot.legend() pyplot.show()

Running the example fits the model and calculates the mean squared error on the train and test sets.

In this case, the model does appear to learn the problem and achieves near-zero mean squared error, at least to three decimal places.

Train: 0.003, Test: 0.007

A line plot of the mean squared error on the train (blue) and test (orange) dataset over each training epoch is created.

In this case, we can see that the model rapidly learns to effectively map inputs to outputs for the regression problem and achieves good performance on both datasets over the course of the run, neither overfitting or underfitting the training dataset.

It may be interesting to repeat this experiment and normalize the target variable instead and compare results.

We have seen that data scaling can stabilize the training process when fitting a model for regression with a target variable that has a wide spread.

It is also possible to improve the stability and performance of the model by scaling the input variables.

In this section, we will design an experiment to compare the performance of different scaling methods for the input variables.

The input variables also have a Gaussian data distribution, like the target variable, therefore we would expect that standardizing the data would be the best approach. This is not always the case.

We can compare the performance of the unscaled input variables to models fit with either standardized and normalized input variables.

The first step is to define a function to create the same 1,000 data samples, split them into train and test sets, and apply the data scaling methods specified via input arguments. The *get_dataset()* function below implements this, requiring the scaler to be provided for the input and target variables and returns the train and test datasets split into input and output components ready to train and evaluate a model.

# prepare dataset with input and output scalers, can be none def get_dataset(input_scaler, output_scaler): # generate dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # scale inputs if input_scaler is not None: # fit scaler input_scaler.fit(trainX) # transform training dataset trainX = input_scaler.transform(trainX) # transform test dataset testX = input_scaler.transform(testX) if output_scaler is not None: # reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) # fit scaler on training dataset output_scaler.fit(trainy) # transform training dataset trainy = output_scaler.transform(trainy) # transform test dataset testy = output_scaler.transform(testy) return trainX, trainy, testX, testy

Next, we can define a function to fit an MLP model on a given dataset and return the mean squared error for the fit model on the test dataset.

The *evaluate_model()* function below implements this behavior.

# fit and evaluate mse of model on test set def evaluate_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate the model test_mse = model.evaluate(testX, testy, verbose=0) return test_mse

Neural networks are trained using a stochastic learning algorithm. This means that the same model fit on the same data may result in a different performance.

We can address this in our experiment by repeating the evaluation of each model configuration, in this case a choice of data scaling, multiple times and report performance as the mean of the error scores across all of the runs. We will repeat each run 30 times to ensure the mean is statistically robust.

The *repeated_evaluation()* function below implements this, taking the scaler for input and output variables as arguments, evaluating a model 30 times with those scalers, printing error scores along the way, and returning a list of the calculated error scores from each run.

# evaluate model multiple times with given input and output scalers def repeated_evaluation(input_scaler, output_scaler, n_repeats=30): # get dataset trainX, trainy, testX, testy = get_dataset(input_scaler, output_scaler) # repeated evaluation of model results = list() for _ in range(n_repeats): test_mse = evaluate_model(trainX, trainy, testX, testy) print('>%.3f' % test_mse) results.append(test_mse) return results

Finally, we can run the experiment and evaluate the same model on the same dataset three different ways:

- No scaling of inputs, standardized outputs.
- Normalized inputs, standardized outputs.
- Standardized inputs, standardized outputs.

The mean and standard deviation of the error for each configuration is reported, then box and whisker plots are created to summarize the error scores for each configuration.

# unscaled inputs results_unscaled_inputs = repeated_evaluation(None, StandardScaler()) # normalized inputs results_normalized_inputs = repeated_evaluation(MinMaxScaler(), StandardScaler()) # standardized inputs results_standardized_inputs = repeated_evaluation(StandardScaler(), StandardScaler()) # summarize results print('Unscaled: %.3f (%.3f)' % (mean(results_unscaled_inputs), std(results_unscaled_inputs))) print('Normalized: %.3f (%.3f)' % (mean(results_normalized_inputs), std(results_normalized_inputs))) print('Standardized: %.3f (%.3f)' % (mean(results_standardized_inputs), std(results_standardized_inputs))) # plot results results = [results_unscaled_inputs, results_normalized_inputs, results_standardized_inputs] labels = ['unscaled', 'normalized', 'standardized'] pyplot.boxplot(results, labels=labels) pyplot.show()

Tying these elements together, the complete example is listed below.

# compare scaling methods for mlp inputs on regression problem from sklearn.datasets import make_regression from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from matplotlib import pyplot from numpy import mean from numpy import std # prepare dataset with input and output scalers, can be none def get_dataset(input_scaler, output_scaler): # generate dataset X, y = make_regression(n_samples=1000, n_features=20, noise=0.1, random_state=1) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] # scale inputs if input_scaler is not None: # fit scaler input_scaler.fit(trainX) # transform training dataset trainX = input_scaler.transform(trainX) # transform test dataset testX = input_scaler.transform(testX) if output_scaler is not None: # reshape 1d arrays to 2d arrays trainy = trainy.reshape(len(trainy), 1) testy = testy.reshape(len(trainy), 1) # fit scaler on training dataset output_scaler.fit(trainy) # transform training dataset trainy = output_scaler.transform(trainy) # transform test dataset testy = output_scaler.transform(testy) return trainX, trainy, testX, testy # fit and evaluate mse of model on test set def evaluate_model(trainX, trainy, testX, testy): # define model model = Sequential() model.add(Dense(25, input_dim=20, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(1, activation='linear')) # compile model model.compile(loss='mean_squared_error', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate the model test_mse = model.evaluate(testX, testy, verbose=0) return test_mse # evaluate model multiple times with given input and output scalers def repeated_evaluation(input_scaler, output_scaler, n_repeats=30): # get dataset trainX, trainy, testX, testy = get_dataset(input_scaler, output_scaler) # repeated evaluation of model results = list() for _ in range(n_repeats): test_mse = evaluate_model(trainX, trainy, testX, testy) print('>%.3f' % test_mse) results.append(test_mse) return results # unscaled inputs results_unscaled_inputs = repeated_evaluation(None, StandardScaler()) # normalized inputs results_normalized_inputs = repeated_evaluation(MinMaxScaler(), StandardScaler()) # standardized inputs results_standardized_inputs = repeated_evaluation(StandardScaler(), StandardScaler()) # summarize results print('Unscaled: %.3f (%.3f)' % (mean(results_unscaled_inputs), std(results_unscaled_inputs))) print('Normalized: %.3f (%.3f)' % (mean(results_normalized_inputs), std(results_normalized_inputs))) print('Standardized: %.3f (%.3f)' % (mean(results_standardized_inputs), std(results_standardized_inputs))) # plot results results = [results_unscaled_inputs, results_normalized_inputs, results_standardized_inputs] labels = ['unscaled', 'normalized', 'standardized'] pyplot.boxplot(results, labels=labels) pyplot.show()

Running the example prints the mean squared error for each model run along the way.

After each of the three configurations have been evaluated 30 times each, the mean errors for each are reported.

Your specific results may vary, but the general trend should be the same as is listed below.

In this case, we can see that as we expected, scaling the input variables does result in a model with better performance. Unexpectedly, better performance is seen using normalized inputs instead of standardized inputs. This may be related to the choice of the rectified linear activation function in the first hidden layer.

... >0.010 >0.012 >0.005 >0.008 >0.008 Unscaled: 0.007 (0.004) Normalized: 0.001 (0.000) Standardized: 0.008 (0.004)

A figure with three box and whisker plots is created summarizing the spread of error scores for each configuration.

The plots show that there was little difference between the distributions of error scores for the unscaled and standardized input variables, and that the normalized input variables result in better performance and more stable or a tighter distribution of error scores.

These results highlight that it is important to actually experiment and confirm the results of data scaling methods rather than assuming that a given data preparation scheme will work best based on the observed distribution of the data.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Normalize Target Variable**. Update the example and normalize instead of standardize the target variable and compare results.**Compared Scaling for Target Variable**. Update the example to compare standardizing and normalizing the target variable using repeated experiments and compare the results.**Other Scales**. Update the example to evaluate other min/max scales when normalizing and compare performance, e.g. [-1, 1] and [0.0, 0.5].

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

- How to Scale Data for Long Short-Term Memory Networks in Python
- How to Scale Machine Learning Data From Scratch With Python
- How to Normalize and Standardize Time Series Data in Python
- How to Prepare Your Data for Machine Learning in Python with Scikit-Learn

- Section 8.2 Input normalization and encoding, Neural Networks for Pattern Recognition, 1995.

- sklearn.datasets.make_regression API
- sklearn.preprocessing.MinMaxScaler API
- sklearn.preprocessing.StandardScaler API

In this tutorial, you discovered how to improve neural network stability and modeling performance by scaling data.

Specifically, you learned:

- Data scaling is a recommended pre-processing step when working with deep learning neural networks.
- Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
- How to apply standardization and normalization to improve the performance of a Multilayer Perceptron model on a regression predictive modeling problem.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Improve Neural Network Stability and Modeling Performance With Data Scaling appeared first on Machine Learning Mastery.

]]>The post How to Develop Deep Learning Neural Networks With Greedy Layer-Wise Pretraining appeared first on Machine Learning Mastery.

]]>Training deep neural networks was traditionally challenging as the vanishing gradient meant that weights in layers close to the input layer were not updated in response to errors calculated on the training dataset.

An innovation and important milestone in the field of deep learning was greedy layer-wise pretraining that allowed very deep neural networks to be successfully trained, achieving then state-of-the-art performance.

In this tutorial, you will discover greedy layer-wise pretraining as a technique for developing deep multi-layered neural network models.

After completing this tutorial, you will know:

- Greedy layer-wise pretraining provides a way to develop deep multi-layered neural networks whilst only ever training shallow networks.
- Pretraining can be used to iteratively deepen a supervised model or an unsupervised model that can be repurposed as a supervised model.
- Pretraining may be useful for problems with small amounts labeled data and large amounts of unlabeled data.

Let’s get started.

This tutorial is divided into four parts; they are:

- Greedy Layer-Wise Pretraining
- Multi-Class Classification Problem
- Supervised Greedy Layer-Wise Pretraining
- Unsupervised Greedy Layer-Wise Pretraining

Traditionally, training deep neural networks with many layers was challenging.

As the number of hidden layers is increased, the amount of error information propagated back to earlier layers is dramatically reduced. This means that weights in hidden layers close to the output layer are updated normally, whereas weights in hidden layers close to the input layer are updated minimally or not at all. Generally, this problem prevented the training of very deep neural networks and was referred to as the *vanishing gradient problem*.

An important milestone in the resurgence of neural networking that initially allowed the development of deeper neural network models was the technique of greedy layer-wise pretraining, often simply referred to as “*pretraining*.”

The deep learning renaissance of 2006 began with the discovery that this greedy learning procedure could be used to find a good initialization for a joint learning procedure over all the layers, and that this approach could be used to successfully train even fully connected architectures.

— Page 528, Deep Learning, 2016.

Pretraining involves successively adding a new hidden layer to a model and refitting, allowing the newly added model to learn the inputs from the existing hidden layer, often while keeping the weights for the existing hidden layers fixed. This gives the technique the name “*layer-wise*” as the model is trained one layer at a time.

The technique is referred to as “*greedy*” because the piecewise or layer-wise approach to solving the harder problem of training a deep network. As an optimization process, dividing the training process into a succession of layer-wise training processes is seen as a greedy shortcut that likely leads to an aggregate of locally optimal solutions, a shortcut to a good enough global solution.

Greedy algorithms break a problem into many components, then solve for the optimal version of each component in isolation. Unfortunately, combining the individually optimal components is not guaranteed to yield an optimal complete solution.

— Page 323, Deep Learning, 2016.

Pretraining is based on the assumption that it is easier to train a shallow network instead of a deep network and contrives a layer-wise training process that we are always only ever fitting a shallow model.

… builds on the premise that training a shallow network is easier than training a deep one, which seems to have been validated in several contexts.

— Page 529, Deep Learning, 2016.

The key benefits of pretraining are:

- Simplified training process.
- Facilitates the development of deeper networks.
- Useful as a weight initialization scheme.
- Perhaps lower generalization error.

In general, pretraining may help both in terms of optimization and in terms of generalization.

— Page 325, Deep Learning, 2016.

There are two main approaches to pretraining; they are:

- Supervised greedy layer-wise pretraining.
- Unsupervised greedy layer-wise pretraining.

Broadly, supervised pretraining involves successively adding hidden layers to a model trained on a supervised learning task. Unsupervised pretraining involves using the greedy layer-wise process to build up an unsupervised autoencoder model, to which a supervised output layer is later added.

It is common to use the word “pretraining” to refer not only to the pretraining stage itself but to the entire two phase protocol that combines the pretraining phase and a supervised learning phase. The supervised learning phase may involve training a simple classifier on top of the features learned in the pretraining phase, or it may involve supervised fine-tuning of the entire network learned in the pretraining phase.

— Page 529, Deep Learning, 2016.

Unsupervised pretraining may be appropriate when you have a significantly larger number of unlabeled examples that can be used to initialize a model prior to using a much smaller number of examples to fine tune the model weights for a supervised task.

…. we can expect unsupervised pretraining to be most helpful when the number of labeled examples is very small. Because the source of information added by unsupervised pretraining is the unlabeled data, we may also expect unsupervised pretraining to perform best when the number of unlabeled examples is very large.

— Page 532, Deep Learning, 2016.

Although the weights in prior layers are held constant, it is common to fine tune all weights in the network at the end after the addition of the final layer. As such, this allows pretraining to be considered a type of weight initialization method.

… it makes use of the idea that the choice of initial parameters for a deep neural network can have a significant regularizing effect on the model (and, to a lesser extent, that it can improve optimization).

— Page 530-531, Deep Learning, 2016.

Greedy layer-wise pretraining is an important milestone in the history of deep learning, that allowed the early development of networks with more hidden layers than was previously possible. The approach can be useful on some problems; for example, it is best practice to use unsupervised pretraining for text data in order to provide a richer distributed representation of words and their interrelationships via word2vec.

Today, unsupervised pretraining has been largely abandoned, except in the field of natural language processing […] the advantage of pretraining is that one can pretrain once on a huge unlabeled set (for example with a corpus containing billions of words), learn a good representation (typically of words, but also of sentences), and then use this representation or fine-tune it for a supervised task for which the training set contains substantially fewer examples.

— Page 535, Deep Learning, 2016.

Nevertheless, it is likely better performance may be achieved using modern methods such as better activation functions, weight initialization, variants of gradient descent, and regularization methods.

Today, we now know that greedy layer-wise pretraining is not required to train fully connected deep architectures, but the unsupervised pretraining approach was the first method to succeed.

— Page 528, Deep Learning, 2016.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

We will use a small multi-class classification problem as the basis to demonstrate the effect of greedy layer-wise pretraining on model performance.

The scikit-learn class provides the make_blobs() function that can be used to create a multi-class classification problem with the prescribed number of samples, input variables, classes, and variance of samples within a class.

The problem will be configured with two input variables (to represent the *x* and *y* coordinates of the points) and a standard deviation of 2.0 for points within each group. We will use the same random state (seed for the pseudorandom number generator) to ensure that we always get the same data points.

# generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2)

The results are the input and output elements of a dataset that we can model.

In order to get a feeling for the complexity of the problem, we can plot each point on a two-dimensional scatter plot and color each point by class value.

The complete example is listed below.

# scatter plot of blobs dataset from sklearn.datasets.samples_generator import make_blobs from matplotlib import pyplot from numpy import where # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # scatter plot for each class value for class_value in range(3): # select indices of points with the class label row_ix = where(y == class_value) # scatter plot for points with a different color pyplot.scatter(X[row_ix, 0], X[row_ix, 1]) # show plot pyplot.show()

Running the example creates a scatter plot of the entire dataset. We can see that the standard deviation of 2.0 means that the classes are not linearly separable (separable by a line), causing many ambiguous points.

This is desirable as it means that the problem is non-trivial and will allow a neural network model to find many different “good enough” candidate solutions.

In this section, we will use greedy layer-wise supervised learning to build up a deep Multilayer Perceptron (MLP) model for the blobs supervised learning multi-class classification problem.

Pretraining is not required to address this simple predictive modeling problem. Instead, this is a demonstration of how to perform supervised greedy layer-wise pretraining that can be used as a template for larger and more challenging supervised learning problems.

As a first step, we can develop a function to create 1,000 samples from the problem and split them evenly into train and test datasets. The *prepare_data()* function below implements this and returns the train and test sets in terms of the input and output components.

# prepare the dataset def prepare_data(): # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, testX, trainy, testy

We can call this function to prepare the data.

# prepare data trainX, testX, trainy, testy = prepare_data()

Next, we can train and fit a base model.

This will be an MLP that expects two inputs for the two input variables in the dataset and has one hidden layer with 10 nodes and uses the rectified linear activation function. The output layer has three nodes in order to predict the probability for each of the three classes and uses the softmax activation function.

# define model model = Sequential() model.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax'))

The model is fit using stochastic gradient descent with the sensible default learning rate of 0.01 and a high momentum value of 0.9. The model is optimized using cross entropy loss.

# compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

The model is then fit on the training dataset for 100 epochs with a default batch size of 32 examples.

# fit model model.fit(trainX, trainy, epochs=100, verbose=0)

The *get_base_model()* function below ties these elements together, taking the training dataset as arguments and returning a fit baseline model.

# define and fit the base model def get_base_model(trainX, trainy): # define model model = Sequential() model.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) return model

We can call this function to prepare the base model to which we can later add layers one at a time.

# get the base model model = get_base_model(trainX, trainy)

We need to be able to easily evaluate the performance of a model on the train and test sets.

The *evaluate_model()* function below takes the train and test sets as arguments as well as a model and returns the accuracy on both datasets.

# evaluate a fit model def evaluate_model(model, trainX, testX, trainy, testy): _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) return train_acc, test_acc

We can call this function to calculate and report the accuracy of the base model and store the scores away in a dictionary against the number of layers in the model (currently two, one hidden and one output layer) so we can plot the relationship between layers and accuracy later.

# evaluate the base model scores = dict() train_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy) print('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc))

We can now outline the process of greedy layer-wise pretraining.

A function is required that can add a new hidden layer and retrain the model but only update the weights in the newly added layer and in the output layer.

This requires first storing the current output layer including its configuration and current set of weights.

# remember the current output layer output_layer = model.layers[-1]

Then removing the output layer from the stack of layers in the model.

# remove the output layer model.pop()

All of the remaining layers in the model can then be marked as non-trainable, meaning that their weights cannot be updated when the *fit()* function is called again.

# mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False

We can then add a new hidden layer, in this case with the same configuration as the first hidden layer added in the base model.

# add a new hidden layer model.add(Dense(10, activation='relu', kernel_initializer='he_uniform'))

Finally, the output layer can be added back and the model can be refit on the training dataset.

# re-add the output layer model.add(output_layer) # fit model model.fit(trainX, trainy, epochs=100, verbose=0)

We can tie all of these elements into a function named *add_layer()* that takes the model and the training dataset as arguments.

# add one new layer and re-train only the new layer def add_layer(model, trainX, trainy): # remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop() # mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False # add a new hidden layer model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) # re-add the output layer model.add(output_layer) # fit model model.fit(trainX, trainy, epochs=100, verbose=0)

This function can then be called repeatedly based on the number of layers we wish to add to the model.

In this case, we will add 10 layers, one at a time, and evaluate the performance of the model after each additional layer is added to get an idea of how it is impacting performance.

Train and test accuracy scores are stored in the dictionary against the number of layers in the model.

# add layers and evaluate the updated model n_layers = 10 for i in range(n_layers): # add layer add_layer(model, trainX, trainy) # evaluate model train_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy) print('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) # store scores for plotting scores[len(model.layers)] = (train_acc, test_acc)

At the end of the run, a line plot is created showing the number of layers in the model (x-axis) compared to the number model accuracy on the train and test datasets.

We would expect the addition of layers to improve the performance of the model on the training dataset and perhaps even on the test dataset.

# plot number of added layers vs accuracy pyplot.plot(scores.keys(), [scores[k][0] for k in scores.keys()], label='train', marker='.') pyplot.plot(scores.keys(), [scores[k][1] for k in scores.keys()], label='test', marker='.') pyplot.legend() pyplot.show()

Tying all of these elements together, the complete example is listed below.

# supervised greedy layer-wise pretraining for blobs classification problem from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from matplotlib import pyplot # prepare the dataset def prepare_data(): # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, testX, trainy, testy # define and fit the base model def get_base_model(trainX, trainy): # define model model = Sequential() model.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(3, activation='softmax')) # compile model opt = SGD(lr=0.01, momentum=0.9) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) return model # evaluate a fit model def evaluate_model(model, trainX, testX, trainy, testy): _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) return train_acc, test_acc # add one new layer and re-train only the new layer def add_layer(model, trainX, trainy): # remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop() # mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False # add a new hidden layer model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) # re-add the output layer model.add(output_layer) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # prepare data trainX, testX, trainy, testy = prepare_data() # get the base model model = get_base_model(trainX, trainy) # evaluate the base model scores = dict() train_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy) print('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) scores[len(model.layers)] = (train_acc, test_acc) # add layers and evaluate the updated model n_layers = 10 for i in range(n_layers): # add layer add_layer(model, trainX, trainy) # evaluate model train_acc, test_acc = evaluate_model(model, trainX, testX, trainy, testy) print('> layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) # store scores for plotting scores[len(model.layers)] = (train_acc, test_acc) # plot number of added layers vs accuracy pyplot.plot(scores.keys(), [scores[k][0] for k in scores.keys()], label='train', marker='.') pyplot.plot(scores.keys(), [scores[k][1] for k in scores.keys()], label='test', marker='.') pyplot.legend() pyplot.show()

Running the example reports the classification accuracy on the train and test sets for the base model (two layers), then after each additional layer is added (from three to 12 layers).

In this case, we can see that the baseline model does reasonably well on this problem. As the layers are increased, we can roughly see an increase in accuracy for the model on the training dataset, likely as it is beginning to overfit the data. We see a rough drop in classification accuracy on the test dataset, likely because of the overfitting.

> layers=2, train=0.816, test=0.830 > layers=3, train=0.834, test=0.830 > layers=4, train=0.836, test=0.824 > layers=5, train=0.830, test=0.824 > layers=6, train=0.848, test=0.820 > layers=7, train=0.830, test=0.826 > layers=8, train=0.850, test=0.824 > layers=9, train=0.840, test=0.838 > layers=10, train=0.842, test=0.830 > layers=11, train=0.850, test=0.830 > layers=12, train=0.850, test=0.826

A line plot is also created showing the train (blue) and test set (orange) accuracy as each additional layer is added to the model.

In this case, the plot suggests a slight overfitting of the training dataset, but perhaps better test set performance after seven added layers.

An interesting extension to this example would be to allow all weights in the model to be fine tuned with a small learning rate for a large number of training epochs to see if this can further reduce generalization error.

In this section, we will explore using greedy layer-wise pretraining with an unsupervised model.

Specifically, we will develop an autoencoder model that will be trained to reconstruct input data. In order to use this unsupervised model for classification, we will remove the output layer, add and fit a new output layer for classification.

This is slightly more complex than the previous supervised greedy layer-wise pretraining, but we can reuse many of the same ideas and code from the previous section.

The first step is to define, fit, and evaluate an autoencoder model. We will use the same two-layer base model as we did in the previous section, except modify it to predict the input as the output and use mean squared error to evaluate how good the model is at reconstructing a given input sample.

The *base_autoencoder()* function below implements this, taking the train and test sets as arguments, then defines, fits, and evaluates the base unsupervised autoencoder model, printing the reconstruction error on the train and test sets and returning the model.

# define, fit and evaluate the base autoencoder def base_autoencoder(trainX, testX): # define model model = Sequential() model.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(2, activation='linear')) # compile model model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model model.fit(trainX, trainX, epochs=100, verbose=0) # evaluate reconstruction loss train_mse = model.evaluate(trainX, trainX, verbose=0) test_mse = model.evaluate(testX, testX, verbose=0) print('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse)) return model

We can call this function in order to prepare our base autoencoder to which we can add and greedily train layers.

# get the base autoencoder model = base_autoencoder(trainX, testX)

Evaluating an autoencoder model on the blobs multi-class classification problem requires a few steps.

The hidden layers will be used as the basis of a classifier with a new output layer that must be trained then used to make predictions before adding back the original output layer so that we can continue to add layers to the autoencoder.

The first step is to reference, then remove the output layer of the autoencoder model.

# remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop()

All of the remaining hidden layers in the autoencoder must be marked as non-trainable so that the weights are not changed when we train the new output layer.

# mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False

We can now add a new output layer that predicts the probability of an example belonging to reach of the three classes. The model must also be re-compiled using a new loss function suitable for multi-class classification.

# add new output layer model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['acc'])

The model can then be re-fit on the training dataset, specifically training the output layer on how to make class predictions using the learned features from the autoencoder as input.

The classification accuracy of the fit model can then be evaluated on the train and test datasets.

# fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0)

Finally, we can put the autoencoder back together but removing the classification output layer, adding back the original autoencoder output layer and recompiling the model with an appropriate loss function for reconstruction.

# put the model back together model.pop() model.add(output_layer) model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9))

We can tie this together into an *evaluate_autoencoder_as_classifier()* function that takes the model as well as the train and test sets, then returns the train and test set classification accuracy.

# evaluate the autoencoder as a classifier def evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy): # remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop() # mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False # add new output layer model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['acc']) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) # put the model back together model.pop() model.add(output_layer) model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9)) return train_acc, test_acc

This function can be called to evaluate the baseline autoencoder model and then store the accuracy scores in a dictionary against the number of layers in the model (in this case two).

# evaluate the base model scores = dict() train_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy) print('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) scores[len(model.layers)] = (train_acc, test_acc)

We are now ready to define the process for adding and pretraining layers to the model.

The process for adding layers is much the same as the supervised case in the previous section, except we are optimizing reconstruction loss rather than classification accuracy for the new layer.

The *add_layer_to_autoencoder()* function below adds a new hidden layer to the autoencoder model, updates the weights for the new layer and the hidden layers, then reports the reconstruction error on the train and test sets input data. The function does re-mark all prior layers as non-trainable, which is redundant because we already did this in the *evaluate_autoencoder_as_classifier()* function, but I have left it in, in case you decide to reuse this function in your own project.

# add one new layer and re-train only the new layer def add_layer_to_autoencoder(model, trainX, testX): # remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop() # mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False # add a new hidden layer model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) # re-add the output layer model.add(output_layer) # fit model model.fit(trainX, trainX, epochs=100, verbose=0) # evaluate reconstruction loss train_mse = model.evaluate(trainX, trainX, verbose=0) test_mse = model.evaluate(testX, testX, verbose=0) print('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse))

We can now repeatedly call this function, adding layers, and evaluating the effect by using the autoencoder as the basis for evaluating a new classifier.

# add layers and evaluate the updated model n_layers = 5 for _ in range(n_layers): # add layer add_layer_to_autoencoder(model, trainX, testX) # evaluate model train_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy) print('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) # store scores for plotting scores[len(model.layers)] = (train_acc, test_acc)

As before, all accuracy scores are collected and we can use them to create a line graph of the number of model layers vs train and test set accuracy.

# plot number of added layers vs accuracy keys = scores.keys() pyplot.plot(keys, [scores[k][0] for k in keys], label='train', marker='.') pyplot.plot(keys, [scores[k][1] for k in keys], label='test', marker='.') pyplot.legend() pyplot.show()

Tying all of this together, the complete example of unsupervised greedy layer-wise pretraining for the blobs multi-class classification problem is listed below.

# unsupervised greedy layer-wise pretraining for blobs classification problem from sklearn.datasets.samples_generator import make_blobs from keras.layers import Dense from keras.models import Sequential from keras.optimizers import SGD from keras.utils import to_categorical from matplotlib import pyplot # prepare the dataset def prepare_data(): # generate 2d classification dataset X, y = make_blobs(n_samples=1000, centers=3, n_features=2, cluster_std=2, random_state=2) # one hot encode output variable y = to_categorical(y) # split into train and test n_train = 500 trainX, testX = X[:n_train, :], X[n_train:, :] trainy, testy = y[:n_train], y[n_train:] return trainX, testX, trainy, testy # define, fit and evaluate the base autoencoder def base_autoencoder(trainX, testX): # define model model = Sequential() model.add(Dense(10, input_dim=2, activation='relu', kernel_initializer='he_uniform')) model.add(Dense(2, activation='linear')) # compile model model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9)) # fit model model.fit(trainX, trainX, epochs=100, verbose=0) # evaluate reconstruction loss train_mse = model.evaluate(trainX, trainX, verbose=0) test_mse = model.evaluate(testX, testX, verbose=0) print('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse)) return model # evaluate the autoencoder as a classifier def evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy): # remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop() # mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False # add new output layer model.add(Dense(3, activation='softmax')) # compile model model.compile(loss='categorical_crossentropy', optimizer=SGD(lr=0.01, momentum=0.9), metrics=['acc']) # fit model model.fit(trainX, trainy, epochs=100, verbose=0) # evaluate model _, train_acc = model.evaluate(trainX, trainy, verbose=0) _, test_acc = model.evaluate(testX, testy, verbose=0) # put the model back together model.pop() model.add(output_layer) model.compile(loss='mse', optimizer=SGD(lr=0.01, momentum=0.9)) return train_acc, test_acc # add one new layer and re-train only the new layer def add_layer_to_autoencoder(model, trainX, testX): # remember the current output layer output_layer = model.layers[-1] # remove the output layer model.pop() # mark all remaining layers as non-trainable for layer in model.layers: layer.trainable = False # add a new hidden layer model.add(Dense(10, activation='relu', kernel_initializer='he_uniform')) # re-add the output layer model.add(output_layer) # fit model model.fit(trainX, trainX, epochs=100, verbose=0) # evaluate reconstruction loss train_mse = model.evaluate(trainX, trainX, verbose=0) test_mse = model.evaluate(testX, testX, verbose=0) print('> reconstruction error train=%.3f, test=%.3f' % (train_mse, test_mse)) # prepare data trainX, testX, trainy, testy = prepare_data() # get the base autoencoder model = base_autoencoder(trainX, testX) # evaluate the base model scores = dict() train_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy) print('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) scores[len(model.layers)] = (train_acc, test_acc) # add layers and evaluate the updated model n_layers = 5 for _ in range(n_layers): # add layer add_layer_to_autoencoder(model, trainX, testX) # evaluate model train_acc, test_acc = evaluate_autoencoder_as_classifier(model, trainX, trainy, testX, testy) print('> classifier accuracy layers=%d, train=%.3f, test=%.3f' % (len(model.layers), train_acc, test_acc)) # store scores for plotting scores[len(model.layers)] = (train_acc, test_acc) # plot number of added layers vs accuracy keys = scores.keys() pyplot.plot(keys, [scores[k][0] for k in keys], label='train', marker='.') pyplot.plot(keys, [scores[k][1] for k in keys], label='test', marker='.') pyplot.legend() pyplot.show()

Running the example reports both reconstruction error and classification accuracy on the train and test sets for the model for the base model (two layers) then after each additional layer is added (from three to 12 layers).

In this case, we can see that reconstruction error starts low, in fact near-perfect, then slowly increases during training. Accuracy on the training dataset seems to decrease as layers are added to the encoder, although accuracy test seems to improve as layers are added, at least until the model has five layers, after which performance appears to crash.

> reconstruction error train=0.000, test=0.000 > classifier accuracy layers=2, train=0.830, test=0.832 > reconstruction error train=0.001, test=0.002 > classifier accuracy layers=3, train=0.826, test=0.842 > reconstruction error train=0.002, test=0.002 > classifier accuracy layers=4, train=0.820, test=0.838 > reconstruction error train=0.016, test=0.028 > classifier accuracy layers=5, train=0.828, test=0.834 > reconstruction error train=2.311, test=2.694 > classifier accuracy layers=6, train=0.764, test=0.762 > reconstruction error train=2.192, test=2.526 > classifier accuracy layers=7, train=0.764, test=0.760

A line plot is also created showing the train (blue) and test set (orange) accuracy as each additional layer is added to the model.

In this case, the plot suggests there may be some minor benefits in the unsupervised greedy layer-wise pretraining, but perhaps beyond five layers the model becomes unstable.

An interesting extension would be to explore whether fine tuning of all weights in the model prior or after fitting a classifier output layer improves performance.

This section provides more resources on the topic if you are looking to go deeper.

- Greedy Layer-Wise Training of Deep Networks, 2007.
- Why Does Unsupervised Pre-training Help Deep Learning, 2010.

- Section 8.7.4 Supervised Pretraining, Deep Learning, 2016.
- Section 15.1 Greedy Layer-Wise Unsupervised Pretraining, Deep Learning, 2016.

In this tutorial, you discovered greedy layer-wise pretraining as a technique for developing deep multi-layered neural network models.

Specifically, you learned:

- Greedy layer-wise pretraining provides a way to develop deep multi-layered neural networks whilst only ever training shallow networks.
- Pretraining can be used to iteratively deepen a supervised model or an unsupervised model that can be repurposed as a supervised model.
- Pretraining may be useful for problems with small amounts labeled data and large amounts of unlabeled data.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop Deep Learning Neural Networks With Greedy Layer-Wise Pretraining appeared first on Machine Learning Mastery.

]]>