Last Updated on August 14, 2020

I often see practitioners expressing confusion about how to evaluate a deep learning model.

This is often obvious from questions like:

- What random seed should I use?
- Do I need a random seed?
- Why don’t I get the same results on subsequent runs?

In this post, you will discover the procedure that you can use to evaluate deep learning models and the rationale for using it.

You will also discover useful related statistics that you can calculate to present the skill of your model, such as standard deviation, standard error, and confidence intervals.

**Kick-start your project** with my new book Deep Learning With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

## The Beginner’s Mistake

You fit the model to your training data and evaluate it on the test dataset, then report the skill.

Perhaps you use k-fold cross validation to evaluate the model, then report the skill of the model.

This is a mistake made by beginners.

It looks like you’re doing the right thing, but there is a key issue you have not accounted for:

**Deep learning models are stochastic.**

Artificial neural networks use randomness while being fit on a dataset, such as random initial weights and random shuffling of data during each training epoch during stochastic gradient descent.

This means that each time the same model is fit on the same data, it may give different predictions and in turn have different overall skill.

## Estimating Model Skill

(*Controlling for Model Variance*)

We don’t have all possible data; if we did, we would not need to make predictions.

We have a limited sample of data, and from it we need to discover the best model we can.

### Use a Train-Test Split

We do that by splitting the data into two parts, fitting a model or specific model configuration on the first part of the data and using the fit model to make predictions on the rest, then evaluating the skill of those predictions. This is called a train-test split and we use the skill as an estimate for how well we think the model will perform in practice when it makes predictions on new data.

For example, here’s some pseudocode for evaluating a model using a train-test split:

1 2 3 4 |
train, test = split(data) model = fit(train.X, train.y) predictions = model.predict(test.X) skill = compare(test.y, predictions) |

A train-test split is a good approach to use if you have a lot of data or a very slow model to train, but the resulting skill score for the model will be noisy because of the randomness in the data (variance of the model).

This means that the same model fit on different data will give different model skill scores.

### Use k-Fold Cross Validation

We can often tighten this up and get more accurate estimates of model skill using techniques like k-fold cross validation. This is a technique that systematically splits up the available data into k-folds, fits the model on k-1 folds, evaluates it on the held out fold, and repeats this process for each fold.

This results in k different models that have k different sets of predictions, and in turn, k different skill scores.

For example, here’s some pseudocode for evaluating a model using a k-fold cross validation:

1 2 3 4 5 6 7 |
scores = list() for i in k: train, test = split_old(data, i) model = fit(train.X, train.y) predictions = model.predict(test.X) skill = compare(test.y, predictions) scores.append(skill) |

A population of skill scores is more useful as we can take the mean and report the average expected performance of the model, which is likely to be closer to the actual performance of the model in practice. For example:

1 |
mean_skill = sum(scores) / count(scores) |

We can also calculate a standard deviation using the mean_skill to get an idea of the average spread of scores around the mean_skill:

1 |
standard_deviation = sqrt(1/count(scores) * sum( (score - mean_skill)^2 )) |

## Estimating a Stochastic Model’s Skill

(*Controlling for Model Stability*)

Stochastic models, like deep neural networks, add an additional source of randomness.

This additional randomness gives the model more flexibility when learning, but can make the model less stable (e.g. different results when the same model is trained on the same data).

This is different from model variance that gives different results when the same model is trained on different data.

To get a robust estimate of the skill of a stochastic model, we must take this additional source of variance into account; we must control for it.

### Fix the Random Seed

One way is to use the same randomness every time the model is fit. We can do that by fixing the random number seed used by the system and then evaluating or fitting the model. For example:

1 2 3 4 5 6 7 8 |
seed(1) scores = list() for i in k: train, test = split_old(data, i) model = fit(train.X, train.y) predictions = model.predict(test.X) skill = compare(test.y, predictions) scores.append(skill) |

This is good for tutorials and demonstrations when the same result is needed every time your code is run.

This is fragile and not recommended for evaluating models.

See the post:

### Repeat Evaluation Experiments

A more robust approach is to repeat the experiment of evaluating a non-stochastic model multiple times.

For example:

1 2 3 4 5 6 7 8 9 10 |
scores = list() for i in repeats: run_scores = list() for j in k: train, test = split_old(data, j) model = fit(train.X, train.y) predictions = model.predict(test.X) skill = compare(test.y, predictions) run_scores.append(skill) scores.append(mean(run_scores)) |

Note, we calculate the mean of the estimated mean model skill, the so-called grand mean.

This is my recommended procedure for estimating the skill of a deep learning model.

Because repeats is often >=30, we can easily calculate the standard error of the mean model skill, which is how much the estimated mean of model skill score differs from the unknown actual mean model skill (e.g. how wrong mean_skill might be)

1 |
standard_error = standard_deviation / sqrt(count(scores)) |

Further, we can use the standard_error to calculate a confidence interval for mean_skill. This assumes that the distribution of the results is Gaussian, which you can check by looking at a Histogram, Q-Q plot, or using statistical tests on the collected scores.

For example, the interval of 95% is (1.96 * standard_error) around the mean skill.

1 2 3 |
interval = standard_error * 1.96 lower_interval = mean_skill - interval upper_interval = mean_skill + interval |

There are other perhaps more statistically robust methods for calculating confidence intervals than using the standard error of the grand mean, such as:

- Calculating the Binomial proportion confidence interval.
- Using the bootstrap to estimate an empirical confidence interval.

## How Unstable Are Neural Networks?

It depends on your problem, on the network, and on its configuration.

I would recommend performing a sensitivity analysis to find out.

Evaluate the same model on the same data many times (30, 100, or thousands) and only vary the seed for the random number generator.

Then review the mean and standard deviation of the skill scores produced. The standard deviation (average distance of scores from the mean score) will give you an idea of just how unstable your model is.

### How Many Repeats?

I would recommend at least 30, perhaps 100, even thousands, limited only by your time and computer resources, and diminishing returns (e.g. standard error on the mean_skill).

More rigorously, I would recommend an experiment that looked at the impact on estimated model skill versus the number of repeats and the calculation of the standard error (how much the mean estimated performance differs from the true underlying population mean).

## Further Reading

- Embrace Randomness in Machine Learning
- How to Train a Final Machine Learning Model
- Comparing Different Species of Cross-Validation
- Empirical Methods for Artificial Intelligence, Cohen, 1995.
- Standard error on Wikipedia

## Summary

In this post, you discovered how to evaluate the skill of deep learning models.

Specifically, you learned:

- The common mistake made by beginners when evaluating deep learning models.
- The rationale for using repeated k-fold cross validation to evaluate deep learning models.
- How to calculate related model skill statistics, such as standard deviation, standard error, and confidence intervals.

Do you have any questions about estimating the skill of deep learning models?

Post your questions in the comments and I will do my best to answer.

How is running trainings times on same dataset different than running training through same data with multiple epochs ?

Can I say that by doing multiple epochs of data with deep learning serves the same purpose to reduce variation in results due to the stochastic nature of the Deep learning algorithm.

Each epoch updates the network weights.

Each run through all epochs results in a different model given different random initial conditions.

Does that help?

It certainly helps me. I don’t know why these basic things are not more emphasized (or even mentioned) on books and courses.

Thanks, this might also help:

https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/

I have a strong background in ERP consulting services industry but I have not done programming in last several years. I want to know if I learn learning machine I can utilize it to start small startup business.

Yes. Consider starting with Weka to learn applied machine learning without any programming:

https://machinelearningmastery.com/start-here/#weka

Thank you so much for your great article. Would you mind me asking three questions?

1. Could you please tell me when we do an evaluation of the Skill of Deep Learning Models? Before hyperparameter tuning or after having selected the best model? Based on my understanding, I chose after. Please correct me if I am wrong.

2. For deep learning, it is such a time-consuming process. If we have to do 30, even 100 or 1000 repeats, is it feasible for a real project?

3. If the final model is an ensemble model which includes several different models, how can I evaluate the skill of the model?

1. Evaluating the skill in any circumstance.

2. It very well is not feasible for a large project and using checkpointing to save the “best” model seen so far during training may be a better approach.

3. In theory, the same approach could be used.

A little off-topic but when do you think the deep learning bubble will pop?

When the techniques stop delivering value to business in general, or practitioners continue to fail to deliver the value promised.

Hi Jason,

Is there any rule of thumb in order to select the number of repeats? I am a bit confused in this respect as my database is huge and I have 500 epochs in my CNN network.

Great question.

You can perform a sensitivity analysis to fine the variance in the score and the point of diminishing returns for the number of repeats. Perhaps 30, perhaps 100. Back in grad school, we used 1000.

Thanks Jason. I will follow that.

Regards

Hi Jason,

Assuming that after a k-fold cross validation run the standard deviation is so small that one can say that the model is relatively independent of the data partitioning on the given data.

Is it still necessary to repeat the k-fold cross validation 30+ times? Can’t you just do a k-fold cross validation and, if it was satisfactory, then do 30+ training repetitions with a statistical summary of these results?

30 runs with the same training data will measure the stochastic properties of the model.

30 runs on different training data will measure both model variance and the stochastic properties.

Hi Jason. Great article as usual. I have a question which might not be relevant to this topic. How do I know if/when my NN model might fail? Is there a way to check how well my NN model can generalize?

The k-fold cross-validation score is designed to give an idea of how well a model generalizes to new data.

Does that help?

Hi Jason.Great article!!

I have few questions if you don’t mind..

1.My model give different accuracy each time when i change random_state value during training the model like random_state(45) or random_state(32)? What is the reason behind this?

2.Is there any optimum value for random_state?

3. Is there any difference between random_seed or random_state?

Thanks for your time.

The algorithm is stochastic, learn more here:

https://machinelearningmastery.com/randomness-in-machine-learning/

There is no optimal random seed, we are evaluating the average performance, learn more about random numbers here:

https://machinelearningmastery.com/introduction-to-random-number-generators-for-machine-learning/

random_state/seed are the same thing.

Hi Jason,

thanks for this great post.

Please tell me if I understand it right: when I repeat evaluation experiments, I split the data in each run randomly, correct? (in contrast to k fold cross validation where the split is fixed). The percentage of split, e.g. 70% training data and 30% test data, then depends on the amount of data available?

Thanks in advance for your reply.

Best regards,

David

Yes.

ok! how about comparing two deep convolutional neural networks that differ only in their training data? (suppose data augmentation vs. no data augmentation) would you run each algorithm several times and average the performance determined on random parts of the data held out as the test sets? or would it here be important to evaluate the two algorithms in each run on the same held out test set to guarantee an equal comparison?

thanks again for your answer!

Great question.

The same data splits used across each case is required (train/test), although the specific data used to train the treatment case would be different because of the augmentation.

Repeating the evaluation of each case on the same data splits will measure variance in each model in response to random initalization/training.

Repeating the evaluation of each case on different but paired split (same splits used for each case) would measure variance in the model behavior more generally – across specific training data, and I think this is what you want to aim for.

Does that help?

yes, that helps! thank you very much!

Hi Jason,

A few questions:

1) How do I choose the best model after evaluation and getting the average performance? Is that even a correct question?

2) Like when we do a grid search we can get the “best” model, anything similar?

3) If that is not what we are trying to do here, how do I use this in the production env?

4) Can I pickle the complete thing (.h5)? And just the predict() for new values.

This is for any deep learning algorithm. I have never been able to get consistent results, I tried all the ideas.

Thank you,

Satish

Choose the model that best meets the project/stakeholder goals, e.g. low complexity and high skill.

You can save a final model, here’s an example:

https://machinelearningmastery.com/save-load-keras-deep-learning-models/

This is a great article. I am currently working on Sentiment analysis and using LSTM model. The data is highly imbalanced, 46%, 45% and 8.7% is the distribution for positive, negative and neutral classes. So, my model is kind of overfitting. I used max sentence length 100, however actual length is 5000 (but if I use 5000 I am facing memory issue). Could you please suggest.

Perhaps try a cost-sensitive model:

https://machinelearningmastery.com/cost-sensitive-neural-network-for-imbalanced-classification/

Hi Jason, thanks for the insight.

Is there a way I can visualize how my neural network is fairing after using kfold cross validation, i.e. Is there a way I can plot a learning curve after applying kfold cross validation on a neural network.

You can estimate model performance using k-fold cross-validation.

You could calculate learning curve for each of the k-folds and plot them all together – I believe I have a few examples of this on the blog, for example:

https://machinelearningmastery.com/how-to-develop-a-cnn-from-scratch-for-fashion-mnist-clothing-classification/

Hi Jason,

I would like to make you a very specific questiion…which I could not get any answer

I understand that in ML we need to extract the features of the data in order to create the features vector. This a hard process, but it is a necessary process

On the other hand, DL makes this extraction features automatically (it learns the best way to represent the data) and after that it creates the “prediction”

However, checking our all examples in internet (docens really) I see that all people apply ML processes before applying DL (it means removing stop words , stemming, lemmatization, tokenization, removing duplicated words, Word embedding, one hot encoding, and so on)

Is it correct?…Should I help DL manipulating the data before training instead of use raw data?.

is this a good practice?

I appreciate if you can help me

Best

Carlos G

Yes, reducing the size of the vocab in NLP problems makes the problem simpler and easier to model – removes irrelevant data.

It may or may not be required, but it helps a lot in practice to speed up the model and improve performance. Try it yourself – with and without such preparation.

Sure, thanks for your dedicated time.

Best

Carlos G.

You’re welcome.

Hi,

You mention using averages and standard deviations to summarize the performance of a given network (i.e. fixed network type, number of layers, activation functions in these layers and so on) based on repeated re-fitting of the networkusing the same data set. I want to drill down on the issue of summary statistics for this problem.

My Q: do you have any thoughts on whether median and e.g. interquartile range could be used instead of average and standard deviation?

The reason why I am asking is that I imagine that in general the distribution of the performance metric for repeated fits of the network will not be normal, so the average + std might be wrong statistics to use.

I am especially interested in the case where the number of model re-refits would be in the range 100-300 (this just happens to be the computational boundary that I am facing): even if we assume that – very roughly speaking – the CLT applies to the performance metric used, then 100-300 might not give good enough convergence.

In any case, for the problem I am working on (this does not necessarily generalize…) there does NOT seem to be much difference whether median or avg is used.

Regards!

Hi Artur…The following resource may be of interest to you:

https://machinelearningmastery.com/regression-metrics-for-machine-learning/