Deep learning neural networks are challenging to configure and train.
There are decades of tips and tricks spread across hundreds of research papers, source code, and in the heads of academics and practitioners.
The book “Neural Networks: Tricks of the Trade” originally published in 1998 and updated in 2012 at the cusp of the deep learning renaissance ties together the disparate tips and tricks into a single volume. It includes advice that is required reading for all deep learning neural network practitioners.
In this post, you will discover the book “Neural Networks: Tricks of the Trade” that provides advice by neural network academics and practitioners on how to get the most out of your models.
After reading this post, you will know:
- The motivation for why the book was written.
- A breakdown of the chapters and topics in the first and second editions.
- A list and summary of the must-read chapters for every neural network practitioner.
Let’s get started.
Neural Networks: Tricks of the Trade is a collection of papers on techniques to get better performance from neural network models.
The first edition was published in 1998 comprised of five parts and 17 chapters. The second edition was published right on the cusp of the new deep learning renaissance in 2012 and includes three more parts and 13 new chapters.
If you are a deep learning practitioner, then it is a must read book.
I own and reference both editions.
The motivation for the book was to collate the empirical and theoretically grounded tips, tricks, and best practices used to get the best performance from neural network models in practice.
The author’s concern is that many of the useful tips and tricks are tacit knowledge in the field, trapped in peoples heads, code bases, or at the end of conference papers and that beginners to the field should be aware of them.
It is our belief that researchers and practitioners acquire, through experience and word-of-mouth, techniques and heuristics that help them successfully apply neural networks to difficult real-world problems. […] they are usually hidden in people’s heads or in the back pages of space-constrained conference papers.
The book is an effort to try to group the tricks together, after the success of a workshop at the 1996 NIPS conference with the same name.
This book is an outgrowth of a 1996 NIPS workshop called Tricks of the Trade whose goal was to begin the process of gathering and documenting these tricks. The interest that the workshop generated motivated us to expand our collection and compile it into this book.
— Page 1, Neural Networks: Tricks of the Trade, Second Edition, 2012.
Breakdown of First Edition
The first edition of the book was put together (edited) by Genevieve Orr and Klaus-Robert Muller comprised of five parts and 17 chapters and was published 20 years ago in 1998.
Each part includes a useful preface that summarizes what to expect in the upcoming chapters, and each chapter written by one or more academics in the field.
The breakdown of this first edition was as follows:
Part 1: Speeding Learning
- Chapter 1: Efficient BackProp
Part 2: Regularization Techniques to Improve Generalization
- Chapter 2: Early Stopping – But When?
- Chapter 3: A Simple Trick for Estimating the Weight Decay Parameter
- Chapter 4: Controlling the Hyperparameter Search on MacKay’s Bayesian Neural Network Framework
- Chapter 5: Adaptive Regularization in Neural Network Modeling
- Chapter 6: Large Ensemble Averaging
Part 3: Improving Network Models and Algorithmic Tricks
- Chapter 7: Square Unit Augmented, Radically Extended, Multilayer Perceptrons
- Chapter 8: A Dozen Tricks with Multitask Learning
- Chapter 9: Solving the Ill-Conditioning on Neural Network Learning
- Chapter 10: Centering Neural Network Gradient Factors
- Chapter 11: Avoiding Roundoff Error in Backpropagating Derivatives
Part 4: Representation and Incorporating PRior Knowledge in Neural Network Training
- Chapter 12: Transformation Invariance in Pattern Recognition – Tangent Distance and Tangent Propagation
- Chapter 13: Combining Neural Networks and Context-Driven Search for On-Line Printed Handwriting Recognition in the Newton
- Chapter 14: Neural Network Classification and Prior Class Probabilities
- Chapter 15: Applying Divide and Conquer to Large Scale Pattern Recognition Tasks
Part 5: Tricks for Time Series
- Chapter 16: Forecasting the Economy with Neural Nets: A Survey of Challenges and Solutions
- Chapter 17: How to Train Neural Networks
It is an expensive book, and if you can pick-up a cheap second-hand copy of this first edition, then I highly recommend it.
Want Better Results with Deep Learning?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Additions in the Second Edition
The second edition of the book was released in 2012, seemingly right at the beginning of the large push that became “deep learning.” As such, the book captures the new techniques at the time such as layer-wise pretraining and restricted Boltzmann machines.
It was too early to focus on the ReLU, ImageNet with CNNs, and use of large LSTMs.
Nevertheless, the second edition included three new parts and 13 new chapters.
The breakdown of the additions in the second edition are as follows:
Part 6: Big Learning in Deep Neural Networks
- Chapter 18: Stochastic Gradient Descent Tricks
- Chapter 19: Practical Recommendations for Gradient-Based Training of Deep Architectures
- Chapter 20: Training Deep and Recurrent Networks with Hessian-Free Optimization
- Chapter 21: Implementing Neural Networks Efficiently
Part 7: Better Representations: Invariant, Disentangled and Reusable
- Chapter 22: Learning Feature Representations with K-Means
- Chapter 23: Deep Big Multilayer Perceptrons for Digit Recognition
- Chapter 24: A Practical Guide to Training Restricted Boltzmann Machines
- Chapter 25: Deep Boltzmann Machines and the Centering Trick
- Chapter 26: Deep Learning via Semi-supervised Embedding
Part 8: Identifying Dynamical Systems for Forecasting and Control
- Chapter 27: A Practical Guide to Applying Echo State Networks
- Chapter 28: Forecasting with Recurrent Neural Networks: 12 Tricks
- Chapter 29: Solving Partially Observable Reinforcement Learning Problems with Recurrent Neural Networks
- Chapter 30: 10 Steps and Some Tricks to Set up Neural Reinforcement Controllers
The whole book is a good read, although I don’t recommend reading all of it if you are looking for quick and useful tips that you can use immediately.
This is because many of the chapters focus on the writers’ pet projects, or on highly specialized methods. Instead, I recommend reading four specific chapters, two from the first edition and two from the second.
The second edition of the book is worth purchasing for these four chapters alone, and I highly recommend picking up a copy for yourself, your team, or your office.
Fortunately, there are pre-print PDFs of these chapters available for free online.
The recommended chapters are:
- Chapter 1: Efficient BackProp, by Yann LeCun, et al.
- Chapter 2: Early Stopping – But When?, by Lutz Prechelt.
- Chapter 18: Stochastic Gradient Descent Tricks, by Leon Bottou.
- Chapter 19: Practical Recommendations for Gradient-Based Training of Deep Architectures, by Yoshua Bengio.
Let’s take a closer look at each of these chapters in turn.
This chapter focuses on providing very specific tips to get the most out of the stochastic gradient descent optimization algorithm and the backpropagation weight update algorithm.
Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work.
— Page 9, Neural Networks: Tricks of the Trade, First Edition, 1998.
The chapter proceeds to provide a dense and theoretically supported list of tips for configuring the algorithm, preparing input data, and more.
The chapter is so dense that it is hard to summarize, although a good list of recommendations is provided in the “Discussion and Conclusion” section at the end, quoted from the book below:
– shuffle the examples
– center the input variables by subtracting the mean
– normalize the input variable to a standard deviation of 1
– if possible, decorrelate the input variables.
– pick a network with the sigmoid function shown in figure 1.4
– set the target values within the range of the sigmoid, typically +1 and -1.
– initialize the weights to random values as prescribed by 1.16.
The preferred method for training the network should be picked as follows:
– if the training set is large (more than a few hundred samples) and redundant, and if the task is classification, use stochastic gradient with careful tuning, or use the stochastic diagonal Levenberg Marquardt method.
– if the training set is not too large, or if the task is regression, use conjugate gradient.
— Pages 47-48, Neural Networks: Tricks of the Trade, First Edition, 1998.
The field of applied neural networks has come a long way in the twenty years since this was published (e.g. the comments on sigmoid activation functions are no longer relevant), yet the basics have not changed.
This chapter is required reading for all deep learning practitioners.
Early Stopping – But When?
This chapter describes the simple yet powerful regularization method called early stopping that will halt the training of a neural network when the performance of the model begins to degrade on a hold-out validation dataset.
Validation can be used to detect when overfitting starts during supervised training of a neural network; training is then stopped before convergence to avoid the overfitting (“early stopping”)
— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.
The challenge of early stopping is the choice and configuration of the trigger used to stop the training process, and the systematic configuration of early stopping is the focus of the chapter.
The general early stopping criteria are described as:
- GL: stop as soon as the generalization loss exceeds a specified threshold.
- PQ: stop as soon as the quotient of generalization loss and progress exceeds a threshold.
- UP: stop when the generalization error increases in strips.
Three recommendations are provided, e.g. “the trick“:
1. Use fast stopping criteria unless small improvements of network performance (e.g. 4%) are worth large increases of training time (e.g. factor 4).
2. To maximize the probability of finding a “good” solution (as opposed to maximizing the average quality of solutions), use a GL criterion.
3. To maximize the average quality of solutions, use a PQ criterion if the net- work overfits only very little or an UP criterion otherwise.
— Page 60, Neural Networks: Tricks of the Trade, First Edition, 1998.
The rules are analyzed empirically over a large number of training runs and test problems. The crux of the finding is that being more patient with the early stopping criteria results in better hold-out performance at the cost of additional computational complexity.
I conclude slower stopping criteria allow for small improvements in generalization (here: about 4% on average), but cost much more training time (here: about factor 4 longer on average).
— Page 55, Neural Networks: Tricks of the Trade, First Edition, 1998.
Stochastic Gradient Descent Tricks
This chapter focuses on a detailed review of the stochastic gradient descent optimization algorithm and tips to help get the most out of it.
This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large, and provides useful recommendations.
— Page 421, Neural Networks: Tricks of the Trade, Second Edition, 2012.
There is a lot of overlap with Chapter 1: Efficient BackProp, and although the chapter calls out tips along the way with boxes, a useful list of tips is not summarized at the end of the chapter.
Nevertheless, it is a compulsory read for all neural network practitioners.
Below is my own summary of the tips called out in boxes throughout the chapter, mostly quoting directly from the second edition:
- Use stochastic gradient descent (batch=1) when training time is the bottleneck.
- Randomly shuffle the training examples.
- Use preconditioning techniques.
- Monitor both the training cost and the validation error.
- Check the gradients using finite differences.
- Experiment with the learning rates [with] a small sample of the training set.
- Leverage the sparsity of the training examples.
- Use a decaying learning rate.
- Try averaged stochastic gradient (i.e. a specific variant of the algorithm).
Some of these tips are pithy without context; I recommend reading the chapter.
Practical Recommendations for Gradient-Based Training of Deep Architectures
This chapter focuses on the effective training of neural networks and early deep learning models.
It ties together the classical advice from Chapters 1 and 29 but adds comments on (at the time) recent deep learning developments like greedy layer-wise pretraining, modern hardware like GPUs, modern efficient code libraries like BLAS, and advice from real projects tuning the training of models, like the order to train hyperparameters.
This chapter is meant as a practical guide with recommendations for some of the most commonly used hyper-parameters, in particular in the context of learning algorithms based on backpropagated gradient and gradient-based optimization.
— Page 437, Neural Networks: Tricks of the Trade, Second Edition, 2012.
It’s also long, divided into six main sections:
- Deep Learning Innovations. Including greedy layer-wise pretraining, denoising autoencoders, and online learning.
- Gradients. Including mini-batch gradient descent and automatic differentiation.
- Hyperparameters. Including learning rate, mini-batch size, epochs, momentum, nodes, weight regularization, activity regularization, hyperparameter search, and recommendations.
- Debugging and Analysis. Including monitoring loss for overfitting, visualization, and statistics.
- Other Recommendations. Including GPU hardware and use of efficient linear algebra libraries such as BLAS.
- Open Questions. Including the difficulty of training deep models and adaptive learning rates.
There’s far too much for me to summarize; the chapter is dense with useful advice for configuring and tuning neural network models.
Without a doubt, this is required reading and provided the seeds for the recommendations later described in the 2016 book Deep Learning, of which Yoshua Bengio was one of three authors.
The chapter finishes on a strong, optimistic note.
The practice summarized here, coupled with the increase in available computing power, now allows researchers to train neural networks on a scale that is far beyond what was possible at the time of the first edition of this book, helping to move us closer to artificial intelligence.
— Page 473, Neural Networks: Tricks of the Trade, Second Edition, 2012.
Get the Book on Amazon
- Neural Networks: Tricks of the Trade, First Edition, 1998.
- Neural Networks: Tricks of the Trade, Second Edition, 2012.
Other Book Pages
- Neural Networks: Tricks of the Trade, Second Edition, 2012. Springer Homepage.
- Neural Networks: Tricks of the Trade, Second Edition, 2012. Google Books
Pre-Prints of Recommended Chapters
- Efficient BackProp, 1998.
- Early Stopping – But When?, 1998.
- Stochastic Gradient Descent Tricks, 2012.
- Practical Recommendations for Gradient-Based Training of Deep Architectures, 2012.
In this post, you discovered the book “Neural Networks: Tricks of the Trade” that provides advice from neural network academics and practitioners on how to get the most out of your models.
Have you read some or all of this book? What do you think of it?
Let me know in the comments below.
Develop Better Deep Learning Models Today!
Train Faster, Reduce Overftting, and Ensembles
…with just a few lines of python code
Discover how in my new Ebook:
Better Deep Learning
It provides self-study tutorials on topics like: weight decay, batch normalization, dropout, model stacking and much more…
Bring better deep learning to your projects!
Skip the Academics. Just Results.