Evaluate the Performance of Deep Learning Models in Keras

Last Updated on August 7, 2022

Keras is an easy-to-use and powerful Python library for deep learning.

There are a lot of decisions to make when designing and configuring your deep learning models. Most of these decisions must be resolved empirically through trial and error and by evaluating them on real data.

As such, it is critically important to have a robust way to evaluate the performance of your neural networks and deep learning models.

In this post, you will discover a few ways to evaluate model performance using Keras.

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

• May/2016: Original post
• Update Oct/2016: Updated examples for Keras 1.1.0 and scikit-learn v0.18
• Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0
• Update Jun/2022: Update to TensorFlow 2.x syntax

Evaluate the performance of deep learning models in Keras
Photo by Thomas Leuthard, some rights reserved.

Empirically Evaluate Network Configurations

You must make a myriad of decisions when designing and configuring your deep learning models.

Many of these decisions can be resolved by copying the structure of other people’s networks and using heuristics. Ultimately, the best technique is to actually design small experiments and empirically evaluate problems using real data.

This includes high-level decisions like the number, size, and type of layers in your network. It also includes the lower-level decisions like the choice of the loss function, activation functions, optimization procedure, and the number of epochs.

Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of instances.

As such, you need to have a robust test harness that allows you to estimate the performance of a given configuration on unseen data and reliably compare the performance to other configurations.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Data Splitting

The large amount of data and the complexity of the models require very long training times.

As such, it is typical to separate data into training and test datasets or training and validation datasets.

Keras provides two convenient ways of evaluating your deep learning algorithms this way:

1. Use an automatic verification dataset
2. Use a manual verification dataset

Use an Automatic Verification Dataset

Keras can separate a portion of your training data into a validation dataset and evaluate the performance of your model on that validation dataset in each epoch.

You can do this by setting the validation_split argument on the fit() function to a percentage of the size of your training dataset.

For example, a reasonable value might be 0.2 or 0.33 for 20% or 33% of your training data held back for validation.

The example below demonstrates the use of an automatic validation dataset on a small binary classification problem. All examples in this post use the Pima Indians onset of diabetes dataset. You can download it from the UCI Machine Learning Repository and save the data file in your current working directory with the filename pima-indians-diabetes.csv (update: download from here).

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example, you can see that the verbose output on each epoch shows the loss and accuracy on both the training dataset and the validation dataset.

Use a Manual Verification Dataset

Keras also allows you to manually specify the dataset to use for validation during training.

In this example, you can use the handy train_test_split() function from the Python scikit-learn machine learning library to separate your data into a training and test dataset. Use 67% for training and the remaining 33% of the data for validation.

The validation dataset can be specified to the fit() function in Keras by the validation_data argument. It takes a tuple of the input and output datasets.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Like before, running the example provides a verbose output of training that includes the loss and accuracy of the model on both the training and validation datasets for each epoch.

Manual k-Fold Cross Validation

The gold standard for machine learning model evaluation is k-fold cross validation.

It provides a robust estimate of the performance of a model on unseen data. It does this by splitting the training dataset into k subsets, taking turns training models on all subsets except one, which is held out, and evaluating model performance on the held-out validation dataset. The process is repeated until all subsets are given an opportunity to be the held-out validation set. The performance measure is then averaged across all models that are created.

It is important to understand that cross validation means estimating a model design (e.g., 3-layer vs. 4-layer neural network) rather than a specific fitted model. You do not want to use a specific dataset to fit the models and compare the result since this may be due to that particular dataset fitting better on one model design. Instead, you want to use multiple datasets to fit, resulting in multiple fitted models of the same design, taking the average performance measure for comparison.

Cross validation is often not used for evaluating deep learning models because of the greater computational expense. For example, k-fold cross validation is often used with 5 or 10 folds. As such, 5 or 10 models must be constructed and evaluated, significantly adding to the evaluation time of a model.

Nevertheless, when the problem is small enough or if you have sufficient computing resources, k-fold cross validation can give you a less-biased estimate of the performance of your model.

In the example below, you will use the handy StratifiedKFold class from the scikit-learn Python machine learning library to split the training dataset into 10 folds. The folds are stratified, meaning that the algorithm attempts to balance the number of instances of each class in each fold.

The example creates and evaluates 10 models using the 10 splits of the data and collects all the scores. The verbose output for each epoch is turned off by passing verbose=0 to the fit() and evaluate() functions on the model.

The performance is printed for each model, and it is stored. The average and standard deviation of the model performance are then printed at the end of the run to provide a robust estimate of model accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example will take less than a minute and will produce the following output:

Summary

In this post, you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data.

You discovered three ways that you can estimate the performance of your deep learning models in Python using the Keras library:

• Use Automatic Verification Datasets
• Use Manual Verification Datasets
• Use Manual k-Fold Cross Validation

Do you have any questions about deep learning with Keras or this post? Ask your question in the comments, and I will do my best to answer it.

Develop Deep Learning Projects with Python!

What If You Could Develop A Network in Minutes

...with just a few lines of Python

Discover how in my new Ebook:
Deep Learning With Python

It covers end-to-end projects on topics like:
Multilayer PerceptronsConvolutional Nets and Recurrent Neural Nets, and more...

265 Responses to Evaluate the Performance of Deep Learning Models in Keras

1. DR Venugopala Rao Manneni July 19, 2016 at 2:14 am #

how to print network diagram

2. Hendrik August 30, 2016 at 9:42 pm #

Could you explain how can one use different evaluation metric (F1-score or even custom one) for evaluation?

• Jason Brownlee August 31, 2016 at 9:46 am #

Hi Hendrik, you can use a suite of objectives with Keras models, here’s a lost:
https://keras.io/objectives/

• Hendrik September 1, 2016 at 5:59 pm #

Thanks for the reply, but I don’t mean the “optimizer” parameter but the “metrics” at compilation, which is currently can be only “accuracy”. I’d like to change it to another evaluation metric (F1-score for instance or AUC).

• Rasika Karle February 5, 2017 at 1:17 am #

Hey Hendrik, did you get the solution to how to use a different evaluation metric in Keras?

3. shixudong September 19, 2016 at 6:28 pm #

could you give some instruction on how to train a deep model. if X, y is so large that can not be fit into memory?

model.fit(X_train, y_train, validation_data=(X_test,y_test), nb_epoch=150, batch_size=10)

• Jason Brownlee September 20, 2016 at 8:31 am #

Great question shixudong.

Keras has a data generator for image data that does not fit into memory:
https://keras.io/preprocessing/image/

The same approach could be used for tabular data:
https://github.com/fchollet/keras/issues/107

• Toqi Tahamid March 5, 2017 at 10:33 pm #

My dataset are in a data folder like this structure–

-data

–Train
——Dog
——Cat

–Test
——Dog
——Cat

1. How do I know the y_true value of the dataset from ImageDataGenerator, if I use the function flow_from_directory?

2. How do I use k-fold cross validation using the fit_generator function in Keras?

• Jason Brownlee March 6, 2017 at 10:58 am #

Sorry, I don’t have examples of using the ImageDataGenerator other than for image augmentation.

• shuang April 21, 2021 at 11:05 pm #

Hi,Do you have to update it now?

4. andrew jeremy September 22, 2016 at 4:41 am #

how do can I use k-fold cross validation, the gold standard for evaluating machine learning with the fit function in keras ?

Also, how can I get an history of accuracy and loss with the cross_val_score module for plotting ?
thanks,
andrew

5. Watterson October 27, 2016 at 12:16 am #

Hey Jason, thanks for the great tutorials!
I wanted to do a CV but read out the accuracy for each fold not only for the training but also for the test data. Would this ansatz be right:

kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
for train, test in kfold.split(X, Y):
model = Sequential()
asd = model.fit(X[train], Y[train], nb_epoch=epoch, validation_data=(X[test], Y[test]), batch_size=10, verbose=1)
cv_acc_train = asd.history[‘acc’]
cv_acc_test = asd.history[‘val_acc’]

• Jason Brownlee October 27, 2016 at 7:46 am #

Looks good to me off the cuff Watterson.

• Seun January 15, 2017 at 3:15 am #

asd = model.fit(X[train], Y[train], nb_epoch=epoch, validation_data=(X[test], Y[test])

Please, is the validation_data not suppose to be validation_data=(Y[test], Y[test]). Also, can I use categorical_crossentropy when my activation is softmax. Thanks so much.

• Jason Brownlee January 15, 2017 at 5:30 am #

Hi Seun, the validation_data must include X and y components.

Yes, I think you can use logloss with softmax, try and see.

6. Jonas December 18, 2016 at 2:07 am #

Hi Jason,

When you use the “automatic verfication dataset” the val_loss is lower than the loss.
“768/768 [==============================] – 0s – loss: 0.4593 – acc: 0.7839 – val_loss: 0.4177 – val_acc: 0.8386”

How can it be possible?

• Jason Brownlee December 18, 2016 at 5:32 am #

Sorry, I don’t understand your question, perhaps could be more specific?

• Jonas December 18, 2016 at 8:06 am #

I understood from my previous lectures that a model is fitting well when the validation error is low and slightly higher than the training error.

But in your first example (Automatic Verification Datasets), the validation error is lower than the training error. I can’t figure how the model can perform better on the validation set rather than on the training set.

Does it mean that the validation split isn’t randomly defined?

• Jason Brownlee December 19, 2016 at 5:29 am #

Great question Jonas,

It might be a statistical fluke and a sign of an unstable model.

It might also be a sign of a poor split of the data, and a sign that a strategy with repeated splits might be warranted.

• Sachin singh October 5, 2020 at 5:56 pm #

Why we are taking batch size =10, i mean how does it affect the model performance

7. Roger January 11, 2017 at 3:45 pm #

Hi, I am no expert. But it looks like you are training on the binary labels:

Y = dataset[:,8] => Labels exist in column 8 correct?
X = dataset[:,0:8] => This includes column 8, i.e. labels

I could be wrong, haven’t looked at the data set.

• Jason Brownlee January 12, 2017 at 9:25 am #

No, I believe the code is correct (at least in Python 2.7).

I’m happy to hear if you get different results on your system.

8. wqp89324 March 2, 2017 at 4:58 am #

Hi, Jason, for a simple feedforward MLP, are there any intuitive criteria for choosing between Keras and Sklearn?

Thanks!

• Jason Brownlee March 2, 2017 at 8:23 am #

Speed, Keras will be faster give it is using optimized symbolic math libs as a backend on CPU or GPUs, whereas sklearn is limited to linear algebra libs on the CPU.

9. pattijane March 5, 2017 at 7:57 am #

Hello,

Thanks a lot for your tutorials, they are great!

This might be a bit trivial but I’d like to ask the difference between when we used a validation split in “model.fit” and we didn’t. And, for instance instead of using separate train/validation/test sets, will using train/test sets with validation split be enough?

Thanks a lot!

• Jason Brownlee March 6, 2017 at 10:57 am #

If you can spare the data, it is a good idea to hold back a validation set for final checking. If you can afford the time, use k-fold cross-validation with multiple repeats to eval your model. We often don’t have the time, so we use train/test splits.

10. Jason March 17, 2017 at 7:39 am #

Hi Jason,

I got one question, how to decide the number of epoch and the batch size?

Thanks

• Jason Brownlee March 17, 2017 at 8:33 am #

Great question! I recommend trial and error.

11. Carolyn March 21, 2017 at 8:36 am #

Hi Jason,

This is a great post!

I’m having trouble combining categorical cross-entropy and StratifiedKFold.

StratifiedKFold assumes the response is a (number,) shape, according to:
http://stackoverflow.com/questions/35022463/stratifiedkfold-indexerror-too-many-indices-for-array

But as you’ve explained before, Keras’s categorical cross-entropy expects the response to be a one-hot matrix. How can I resolve this?

Thank you!

• Jason Brownlee March 21, 2017 at 8:47 am #

You might have to move away from cross validation and rely on repeated random train/test sets.

Alternatively, you could try pre-computing the folds manually or using a modified version of the dataset, the running the cross-validation manually.

• Moji September 19, 2017 at 7:13 am #

I guess you can do the following:
change :
Y=numpy.argmax(Y,axis=1)
and then using : loss=’sparse_categorical_crossentropy’

it works but would that be correct way?

• Jason Brownlee September 19, 2017 at 7:51 am #

It is a way. I recommend evaluating many approaches and see what works best for your data.

12. Luis April 10, 2017 at 7:13 am #

Hi, Jason.

I’m new on keras. In your last example you build and compile the keras model inside each for iteration (I show your code below). I don’t know if it is possible, but it looks like it would be more efficient to build and complle the model one time (outside the for loop) and to fit it with the right data each time inside the loop. isn’t this possible in keras?

for train, test in kfold.split(X, Y):
# create model
model = Sequential()
# Compile model

• Jason Brownlee April 10, 2017 at 7:42 am #

Yes, but you may need to re-initialize the weights.

I am demonstrating complete independence of the model within each loop.

13. Manasi April 13, 2017 at 11:23 pm #

I am trying cross-vlisation code with lstm and getting the following error:
Found array with dim 3. Estimator expected <= 2.

My code is as follows:

kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
# create model
model=Sequential()
model.add(LSTM(50,input_shape=(max_seq,number_of_features),return_sequences = 1, activation = 'relu'))

print(model.summary())

model.compile(loss='binary_crossentropy',
metrics=['accuracy'])

print('Train…')
#model.fit(X_train, y_train,batch_size=16,nb_epoch=1000,validation_data=(X_test,y_test),verbose=2)
#model.fit(X_train, y_train,batch_size=1,nb_epoch=1000,validation_data=(X_test,y_test),verbose=2)

#score, acc = model.evaluate(X_test, y_test, batch_size=16,verbose=0)
#score, acc = model.evaluate(X_test, y_test, batch_size=1,verbose=0)
print('Test score:', score)
print('Test accuracy:', acc)

scores = model.evaluate(X[test], Y[test], verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], acc*100))
cvscores.append(acc * 100)

14. Arno April 17, 2017 at 3:53 pm #

Hi Jason,
In the last part of this article, you are training 10 different models instead of training one 10 times on each fold.
In other cases, how can I select the best model out of the 10 trained ? Is it a good practice in Machine Learning to do so ?
Thanks,
Arno

15. Nirmala May 15, 2017 at 4:01 pm #

Hi sir,

When i try to run the code after building the layers i am facing this error

FileNotFoundError: [WinError 3] The system cannot find the path specified: ‘C:/deeplearning/openblas-0.2.14-int32/bin’

I have changed theano flag path using this

variable = THEANO_FLAGS value = floatX=float32,device=cpu,blas.ldflags=-LC:\openblas -lopenblas

but still i am facing the same problem…

Thank you!!

• Jason Brownlee May 16, 2017 at 8:37 am #

Sorry, I have not seen this error.

Consider posting it as a question to stackoverflow or the theano users group.

16. Connie May 15, 2017 at 5:58 pm #

Hi Jason,
How to give prediction score(not prediction label or prediction probability) of each test instance instead of evaluate result on whole test set?
Thanks.

• Jason Brownlee May 16, 2017 at 8:38 am #

You can predict probabilities with:

17. José June 19, 2017 at 10:07 am #

Hi Jason, thanks for the post. Your blog is great to learn Machine Learning with Python, I am very very grateful for you are sharing this with us. Thanks and great job! Muito obrigado, você é muito generoso em compartilhar seu conhecimento.

• Jason Brownlee June 20, 2017 at 6:33 am #

Thanks José, I’m glad that it is helping.

18. Aymane June 30, 2017 at 1:16 am #

Hey Jason, Thank you for your many posts and responses.
I have tried to follow this tutorial to train and evaluate a multioutput (3) regression deep network using the keras’ Model class API, and here is my code:

#Define cross validation scheme
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(X_training, [Y_training10,Y_training20,Y_training30]):
# Create model
inputs=Input(shape=(72,),name=’input_layer’)
x=Dense(200,activation=’relu’,name=’hidden_layer1′)(inputs)
x=Dropout(.2)(x)
y1=Dense(1,name=’GHI10′)(x)
y2=Dense(1,name=’GHI20′)(x)
y3=Dense(1,name=’GHI30′)(x)
model=Model(inputs=inputs,outputs=[y1,y2,y3])
# Compile model
model.compile(loss=’mean_squared_error’, optimizer=’rmsprop’,metrics=’mean_absolute_error’)
#Fit model
model.fit(X_training[train],[Y_training10[train],Y_training20[train],Y_training30[train]] ,epochs=100, batch_size=10, verbose=1)
# Evaluate model
Scores=model.evaluate(X_training[test],[Y_training10[test],Y_training20[test],Y_training30[test]], verbose=1, sample_weight=None)
print(“%s: %.2f%% (MSE)” % (model.metrics_names[1], scores[1]))
print(“%s: %.2f%% (MSE)” % (model.metrics_names[2], scores[2]))
print(“%s: %.2f%% (MSE)” % (model.metrics_names[3], scores[3]))
cvscores.append([scores[1],scores[2],scores[3]])
print(“%.2f MSE of training” % (numpy.mean(cvscores,axis=0)))

But unfortunately I get this error :
Found input variables with inconsistent number of samples: [15000,3].

I have a [15000sample x 72predictor] as X_training and [15000samples x 3 outputs] as [Y_training10, Y_training20, Y_training30].

Any sort of help would be appreciated.

• Jason Brownlee June 30, 2017 at 8:13 am #

You have to make sure your input and output data match the same of your network and that train and test have the same number of features.

19. Ashley August 8, 2017 at 6:20 am #

Hi, Thank you for an awesome relevant post. I have a question about implementing training\test\validation 80\10\10. Which is sort of outlined in this question: https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set – how would I further split my test set?

• Jason Brownlee August 8, 2017 at 7:52 am #

Why would you further split your data?

• Ashley August 9, 2017 at 12:55 am #

It’s something from the ML course on Udacity. They use a validation set along side test set. It’s outlined in this lecture: https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118330923 Lesson 1 #22. They explain it as your test data bleeding into your training data over time, and this biases the training – so to further split your data, train on train data, validate on validation set and only at the very end test on test data.

• Ashley August 9, 2017 at 12:57 am #

This link gets you to #22 https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118300923

• Ashley August 9, 2017 at 1:08 am #

As I say this – I realise that one way to do it is to train and validate and save the model, then load model and test on the test set.

• Jason Brownlee August 9, 2017 at 6:36 am #

Sure, what is your question exactly?

• Ashley August 9, 2017 at 8:07 am #

Initially, it was: how do I take in a validation set and a test set and have two different testing out puts after final run. i.e. How do I set up my model to have a split test/validation set (so have training, validation, test all in one session)
Now I am just assuming that I should train (fit) on my data, optimise my net according to my validation data (evaluate) and save the model. Then reload it and evaluate again but on test data that it has never seen before – which I am hoping will help me with the small data set I have.

• Jason Brownlee August 10, 2017 at 6:37 am #

You can do that, sounds fine.

20. Michael August 17, 2017 at 1:15 am #

Hey, Jason. Great website! I like how you’ve laid out your posts and how you explain the concepts. I haven’t been able to figure out the final result from using k-fold. Let’s assume I execute a k-fold just like you’ve done in your example. Then, I use the code from another of your great posts to save the model (to JSON) and save the weights. Which of the 10 models (created during the k-fold loop) will be saved? The last of the 10?
I want to have a saved model that I can use on new datasets in the future. How does creating 10 models with k-fold help me get a better model than using an automatic validation split (as described in this post)?

Thank you!

• Jason Brownlee August 17, 2017 at 6:45 am #

CV gives a less biased estimate of the models skill on unseen data than a train/test split, at least in general with smallish datasets (less than millions of obs).

Once you have tuned your model, throw all the trained models away and train a final model with all your data and start using it to make predictions.

Some models are very expensive to train, in which case don’t use CV and keep the best models you train, use them in an ensemble as final models.

Does that help?

• Michael August 17, 2017 at 2:15 pm #

Yes, that post was exactly what I needed. Thank you!

21. Minkyu Ha August 21, 2017 at 5:34 pm #

Hello, Jason

Let me ask a strange behavior my case.

When I train & validate my model with dataset..like

model.fit( train_x, train_y, validation_split=0.1, epochs=15, batch_size=100)

it seems to be overfitting to see between acc and val_acc.

67473/67473 [==============================] – 27s – loss: 2.9052 – acc: 0.6370 – val_loss: 6.0345 – val_acc: 0.3758
Epoch 2/15
67473/67473 [==============================] – 26s – loss: 1.7335 – acc: 0.7947 – val_loss: 6.2073 – val_acc: 0.3788
Epoch 3/15
67473/67473 [==============================] – 26s – loss: 1.5050 – acc: 0.8207 – val_loss: 6.1922 – val_acc: 0.3952
Epoch 4/15
67473/67473 [==============================] – 26s – loss: 1.4130 – acc: 0.8380 – val_loss: 6.2896 – val_acc: 0.4092
Epoch 5/15
67473/67473 [==============================] – 26s – loss: 1.3750 – acc: 0.8457 – val_loss: 6.3136 – val_acc: 0.3953
Epoch 6/15
67473/67473 [==============================] – 26s – loss: 1.3350 – acc: 0.8573 – val_loss: 6.4355 – val_acc: 0.4098
Epoch 7/15
67473/67473 [==============================] – 26s – loss: 1.3045 – acc: 0.8644 – val_loss: 6.3992 – val_acc: 0.4018
Epoch 8/15
67473/67473 [==============================] – 26s – loss: 1.2687 – acc: 0.8710 – val_loss: 6.5578 – val_acc: 0.3897
Epoch 9/15
67473/67473 [==============================] – 26s – loss: 1.2552 – acc: 0.8745 – val_loss: 6.4178 – val_acc: 0.4104
Epoch 10/15
67473/67473 [==============================] – 26s – loss: 1.2195 – acc: 0.8796 – val_loss: 6.5593 – val_acc: 0.4044
Epoch 11/15
67473/67473 [==============================] – 26s – loss: 1.1977 – acc: 0.8833 – val_loss: 6.5514 – val_acc: 0.4041
Epoch 12/15
67473/67473 [==============================] – 26s – loss: 1.1828 – acc: 0.8874 – val_loss: 6.5972 – val_acc: 0.3973
Epoch 13/15
67473/67473 [==============================] – 26s – loss: 1.1665 – acc: 0.8890 – val_loss: 6.5879 – val_acc: 0.3882
Epoch 14/15
67473/67473 [==============================] – 26s – loss: 1.1466 – acc: 0.8931 – val_loss: 6.5610 – val_acc: 0.4104
Epoch 15/15
67473/67473 [==============================] – 27s – loss: 1.1394 – acc: 0.8925 – val_loss: 6.5062 – val_acc: 0.4100

but , when I split the data into train & test with StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
and evaluate it with test data, it shows good result of accuracy.
score = model.evaluate(test_x, test_y)
[1.3547255955601791, 0.82816451482507525]

so.. I tried to validate with test dataset which was split with StratifiedKFold.
model.fit( train_x, train_y, validation_data=(test_x, test_y), epochs=15, batch_size=100)

and it shows good result of val_acc.

Epoch 1/14
67458/67458 [==============================] – 27s – loss: 2.9200 – acc: 0.6006 – val_loss: 1.9954 – val_acc: 0.7508
Epoch 2/14
67458/67458 [==============================] – 26s – loss: 1.8138 – acc: 0.7536 – val_loss: 1.6458 – val_acc: 0.7844
Epoch 3/14
67458/67458 [==============================] – 26s – loss: 1.5869 – acc: 0.7852 – val_loss: 1.5848 – val_acc: 0.7876
Epoch 4/14
67458/67458 [==============================] – 25s – loss: 1.4980 – acc: 0.8056 – val_loss: 1.5353 – val_acc: 0.8015
Epoch 5/14
67458/67458 [==============================] – 25s – loss: 1.4375 – acc: 0.8202 – val_loss: 1.4870 – val_acc: 0.8117
Epoch 6/14
67458/67458 [==============================] – 25s – loss: 1.3795 – acc: 0.8324 – val_loss: 1.4738 – val_acc: 0.8139
Epoch 7/14
67458/67458 [==============================] – 26s – loss: 1.3437 – acc: 0.8400 – val_loss: 1.4677 – val_acc: 0.8146
Epoch 8/14
67458/67458 [==============================] – 26s – loss: 1.3059 – acc: 0.8462 – val_loss: 1.4127 – val_acc: 0.8263
Epoch 9/14
67458/67458 [==============================] – 26s – loss: 1.2758 – acc: 0.8533 – val_loss: 1.4087 – val_acc: 0.8219
Epoch 10/14
67458/67458 [==============================] – 25s – loss: 1.2381 – acc: 0.8602 – val_loss: 1.4095 – val_acc: 0.8242
Epoch 11/14
67458/67458 [==============================] – 26s – loss: 1.2188 – acc: 0.8644 – val_loss: 1.3960 – val_acc: 0.8272
Epoch 12/14
67458/67458 [==============================] – 25s – loss: 1.1991 – acc: 0.8677 – val_loss: 1.3898 – val_acc: 0.8226
Epoch 13/14
67458/67458 [==============================] – 25s – loss: 1.1671 – acc: 0.8733 – val_loss: 1.3370 – val_acc: 0.8380
Epoch 14/14
67458/67458 [==============================] – 25s – loss: 1.1506 – acc: 0.8750 – val_loss: 1.3363 – val_acc: 0.8315

Do you have any idea the reason why the result of auto validation_split and validation with test dataset ?

• Jason Brownlee August 22, 2017 at 6:37 am #

There is no need to validate when using cross validation. The model is doing twice the work.

Perhaps the stratification of the data sample is important to your model?

Perhaps the model performs better on the smaller sample (e.g. 1/10th of data if 10-folds).

• tianyu zhou August 9, 2018 at 1:44 am #

I am having same problem here, when you set shuffle=False when do Kfold CV, you will have low accuracy as well. the auto validation_split didnt shuffle the validation data.
you can try:
StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
model.fit( train_x, train_y, validation_split=0.1, epochs=15, verbose=1 ,batch_size=100)
score = model.evaluate(test_x, test_y)

you will see val_acc is very low and final score is good.

22. Macarena September 6, 2017 at 8:18 pm #

Hello,
First of all congrats for these tutorials, they are great!
I’m trying to use StratifiedKFold validation in a multiple inputs network (3 inputs) for a regression problem, and I’m having several problems when using it. First of all, in the step:
“for train,test in kfold.split()” I’m introducing just one of the inputs and the labels structure, this way: “for train, test in kfold.split(X1, Y):”, and then inside the loop I define “X1_train = X1[train], X2_train = X2[train], X3_train=X3[train]” and so on. This way, when fitting my model I use “model.fit([X1_train, X2_train, X3_train], Y_train….)”. But I’m getting the error “n_splits=10 cannot be greater than the number of members in each class”, and I don’t know how to fix it.

I have also try the option you give in this tutorial: https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
But in this case the error I get is “Found input variables with inconsistent numbers of samples”.

I don’t know how can I implement this, I would appreciate any help. Thanks.

• Jason Brownlee September 7, 2017 at 12:53 pm #

All rows must have the same number of columns, if that helps?

• Macarena September 7, 2017 at 9:40 pm #

X1, X2 and X3 have shape (nb_samples, 2, 8, 10), while Y has shape (nb_samples, 4). I don’t know if it is not able to recognize that the common axis is nb_samples (although I read in the documentation that it takes by default the first axis).
I have resolved it creating an structure of zeros with just one dimension: X_split = np.zeros((X1.shape[0])) and Y_split = np.zeros((Y.shape[0])), and I use those two arrays to create the for loop. But I don’t know why I cannot do it the other way.

• Despoina February 15, 2019 at 8:03 am #

Hello, I am facing the same issue. Could you please provide the code example?
You are creating X_split= = np.zeros((X1.shape[0])) for X2 and X3 and Y_split before
kfold.split(X1, Y)?

Thank you

23. Siyan September 27, 2017 at 12:59 am #

Hi Jason, thanks for your blogs and I learned a lot from your posts.
I encountered a strange problem while using Keras, my problem is regression problem and I would like to show and record the loss and validation loss while training. But when I only assign “validation_split”, then I can only get the “loss” without any “validation loss”, after I manually assign the “validation_data” into model.fit the I can get either “loss” and “validation loss”.
From the document of Keras, “validation_split” will use last XX% of data without shuffle as the validation data, I assume it should have the “validation loss” as well, but I cannot find and get it. Do you have any ideas about it, thanks in advance!

• Jason Brownlee September 27, 2017 at 5:43 am #

If you set validation data or a validation split, then the validation loss will be printed each epoch if verbose=1 and available in the history object at the end of the run.

24. Nafiz September 27, 2017 at 11:35 am #

Hi Jason, after doing the k-fold CV, how do you train the NN on your whole data set? Because usually, we train it as long as the validation set accuracy is increasing. Before applying the NN into the wild, we would like to train it on the whole data set, if our data set size is small. How do we train it then?

25. Shahab October 21, 2017 at 8:49 pm #

Hi Jason. i have trained a network using Keras for segmentation purpose of MRI images. My test data has no ground truth. I need to save the output of network for test set as Images and submit them for evaluation. I would like to ask you how can I do this procedure in testing part. As I know for evaluation in Keras, I need both test samples and the coressponding ground truth!!!!

• Jason Brownlee October 22, 2017 at 5:18 am #

Correct, you need ground truth in order to evaluate any model.

26. gana October 24, 2017 at 3:13 pm #

Thank you for your precious tutorials

I have a question about confusion between test set and validation set in tensorflow+keras.

In your tutorials validation dataset does not affect to training and totally independent from training procedure.
validation dataset is only for monitoring and early stopping. in addition, validation set is not used in training (updating weight, gradient decent, etc).

However i found that wikipedia says in different way as follows:
A validation dataset is a set of examples used to tune the hyperparameters (i.e. the architecture) of a classifier.

It sounds like validation dataset is for tuning parameters means that it is used in training procedure.
If it is true we will face overfitting.

if validation data set does not affect to training as your tutorial then does keras use some part of training dataset automatically for validating and tuning parameters?

• Jason Brownlee October 24, 2017 at 4:02 pm #

Yes, validation set is used for tuning the model and is a subset of the training dataset.

Perhaps this post will clear things up:
https://machinelearningmastery.com/difference-test-validation-datasets/

• gana October 24, 2017 at 6:28 pm #

Let me clear my question again, i am asking this question not because i do not know the concept of three datasets.
it is because i do not know how background of Keras use the datasets.

For example. in this tutorilal https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

You defined only two datasets as:

(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

You said that validation dataset may disappear if there is K-fold validation (in that case validation is picked from training set), however in the tutorial we did not use k-fold validation. So where is validation set? is it still in the training set?

in the code below, fit function uses validation_data for tuning parameters isn’t it? and also you assigned test data to validation data. in that case we need new test data for evaluation, is it right?

model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

in the code below, evaluate function results unbiased score isn’t it? then where is validation data? does keras background code automatically split X_train to train and validation parts?

model.fit(X_train, y_train, nb_epoch=3, batch_size=64)
# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)

• Jason Brownlee October 25, 2017 at 6:43 am #

We do not have to use a validation dataset and in many tutorials I exclude that part of the process for brevity.

• gana October 25, 2017 at 1:14 pm #

Means that keras picks part of training dataset automatically for validating and tuning parameters?
If we do not use validation dataset how to tune parameters?

• Jason Brownlee October 25, 2017 at 4:03 pm #

It can, or we can specify it.

You can tune on the training dataset.

27. Estelle October 27, 2017 at 7:56 am #

Hi Jason,

Thank you very much for your blog and examples, it is great!

Look I merged two of your examples: the one above and the Save and Load Your Keras Deep Learning Models (https://machinelearningmastery.com/save-load-keras-deep-learning-models/). You can see the code below:

# MLP for Pima Indians Dataset with 10-fold cross validation
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import StratifiedKFold
from keras.models import model_from_json
import numpy
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]

#make model
def make_model():
model = Sequential()
# Compile model
return(model)

# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in kfold.split(X, Y):
# create model
model = make_model()
# Fit the model
model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)
# serialize model to JSON
model_json = model.to_json()
with open(“model.json”, “w”) as json_file:
json_file.write(model_json)
# serialize weights to HDF5
model.save_weights(“model.h5”)
print(“Saved model to disk”)

# evaluate the model
scores = model.evaluate(X[test], Y[test], verbose=0)
print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))

cvscores.append(scores[1] * 100)

del model_json
del model
print(“%.2f%% (+/- %.2f%%)” % (numpy.mean(cvscores), numpy.std(cvscores)))

# load json and create model
json_file = open(‘model.json’, ‘r’)
json_file.close()
# load weights into new model

# evaluate loaded model on test data

And the output is as follow:

Saved model to disk
acc: 76.62%
Saved model to disk
acc: 74.03%
Saved model to disk
acc: 71.43%
Saved model to disk
acc: 72.73%
Saved model to disk
acc: 70.13%
Saved model to disk
acc: 64.94%
Saved model to disk
acc: 66.23%
Saved model to disk
acc: 64.94%
Saved model to disk
acc: 63.16%
Saved model to disk
acc: 72.37%
69.66% (+/- 4.32%)

acc: 75.91%

Naively I was expecting to get the save accuracy as the last model I saved (which was 72.37%), but I got 75.91%. Could you please explain how the weights are saved inside a k-fold cross validation?

Thanks,

Estelle

28. Kongpon December 12, 2017 at 1:38 am #

Hi Jason,

I try to use your example in my Classification Model, but I got this Error

ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead.

29. AJ December 19, 2017 at 9:27 pm #

Sir, can you sugest how to do startified kfol crossvalidation for this case.
https://gist.github.com/dirko/1d596ca757a541da96ac3caa6f291229

• Jason Brownlee December 20, 2017 at 5:43 am #

Sorry, I do not have the capacity to review your code.

• AJ December 20, 2017 at 6:31 am #

sorry it’s not my code – just found on the internet and I am learning- So I would like to know that if forward and backward training and testing data exist how to do pass two parameters in kfold.split(X, Y)
as here in the link its given that X_enc_f X_enc_b and y_enc as the forward backward and label encoders

30. Sam Miller January 5, 2018 at 1:28 am #

HI Jason,
How do you apply model.predict when using k-folds?
I want to be able to create a classification report for my model and possibly an AUC plot, heat map, precision-recall graph as well

• Jason Brownlee January 5, 2018 at 5:27 am #

Good question, see this post on creating a final model:
https://machinelearningmastery.com/train-final-machine-learning-model/

• Sam Miller January 6, 2018 at 2:19 am #

Sorry what I meant was, how do you code model.predict/predict_proba when you use the kfold.split method?

The examples I’ve seen that don’t use k-folds have code like model.predict(x_test) after applying model.fit.

I’d like to use precision_recall_curve and roc_curve with k-folds

• Jason Brownlee January 6, 2018 at 5:54 am #

You do not. CV is for evaluating a model, then you can build a final model. Please see the post that I linked.

31. raj kumar January 26, 2018 at 5:20 pm #

We are doing 10-fold cross validation on some optical character recognition data set. we used kfold, kerasclassifier functions. A snap shot of our output is

18000/18000 [==============================] – 1s – loss: 0.5963 – acc: 0.8219
Epoch 144/150
18000/18000 [==============================] – 1s – loss: 0.5951 – acc: 0.8217
Epoch 145/150
18000/18000 [==============================] – 1s – loss: 0.5941 – acc: 0.8219
Epoch 146/150
18000/18000 [==============================] – 1s – loss: 0.5928 – acc: 0.8225
Epoch 147/150
18000/18000 [==============================] – 1s – loss: 0.5908 – acc: 0.8234
Epoch 148/150
18000/18000 [==============================] – 2s – loss: 0.5903 – acc: 0.8199
Epoch 149/150
18000/18000 [==============================] – 1s – loss: 0.5892 – acc: 0.8217
Epoch 150/150
18000/18000 [==============================] – 1s – loss: 0.5917 – acc: 0.8235
1720/2000 [========================>…..] – ETA: 0sAccuracy: 81.24% (0.94%)

How to interpret this output?

• Jason Brownlee January 27, 2018 at 5:55 am #

It shows the progress (epoch number n of m), the loss (minimizing) and the accuracy (maximizing).

What is the problem exactly?

32. Anton January 30, 2018 at 8:07 am #

Hello Jason,
Did i understand correctly: using K-folds won’t necessarily increase the accuracy of the model but instead give a more realistic (or “accurate”) accuracy rate? I’m getting a slightly lower score or accuracy on my model which uses your example of kfolds, vs. a very simple model that simply evaluated on test data
scores = model.evaluate(X_test, Y_test)

• Jason Brownlee January 30, 2018 at 9:59 am #

Yes, the idea is that k-fold cross validation provides a less biased estimate of the skill of your model on unseen data.

This is on average.

A difficult problem or a bad k value can still give poor skill estimates.

33. Maryam March 3, 2018 at 6:51 am #

Hi Jason,
I appreciate your tutorial. tell you the truth I want to write k-fold cross validation from scratch for the first time, but I do not know which tutorial teach novice student better. I want to write k- fold cross validation for lstm, Rnn, cnn.
would u please recommend me which link is the best one to do the issue??
1-https://machinelearningmastery.com/use-keras-deep-learning-models-scikit-learn-python/

2-https://machinelearningmastery.com/evaluate-skill-deep-learning-models/

3-https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6

If there are any better tutarial link to teach k-fold cross validataion for deep learning function in keras with tensorflow, please introduce to us.
Best wishes
Maryam

• Jason Brownlee March 3, 2018 at 8:19 am #

Choose a tutorial that teaches in a style that suits you.

• Maryam March 3, 2018 at 10:47 am #

Tell u the truth they are different with each other and I do not know which is proper for writing k-fold cross validation for RNN,CNN, Lstm?
the written codes for k-fold cross validation are different.
please show me which code will work fine for the issue?
Thank u

• Jason Brownlee March 4, 2018 at 5:59 am #

For RNNs, you may want to use walk-forward validation instead of k-fold cross validation:
https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/

• Maryam March 5, 2018 at 12:26 am #

Hi Jason,
I am grateful for the link. but I have a sentiment analysis binary classification. the given link is for Time Series Forecasting which is not my issue.
I do not know which codes are proper for writing k-fold cross-validation for binary sentiment analysis(text classification) as codes are different and I have not seen a k-fold cross-validation code for cnn or lstm.
Would u please introduce me a similar sample code?

• Jason Brownlee March 5, 2018 at 6:25 am #

I would recommend scikit-learn for cross validation of deep learning models, if you have the resources.

As in the link I provided.

The specific dataset used in the example is irrelevant, you are interested in the cross validation.

I cannot write the code for you. You have everything you need.

34. Maryam March 4, 2018 at 4:02 am #

Hi Jason,
I should appreciate the tutorial but when I copy your code and paste it into my spyder, it gave me error in this command line “model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]).
the error is this: model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
^
IndentationError: unindent does not match any outer indentation level.
but I am sure I have written codes as the same as yours.

To solve the problem I remove the indent and written the code as below but gave me just the final result=acc: 64.47%..
But I want to give each fold’s results like yours:
acc: 77.92%
acc: 68.83%
acc: 72.73%
acc: 64.94%
acc: 77.92%
acc: 35.06%
acc: 74.03%
acc: 68.83%
acc: 34.21%
acc: 72.37%
64.68% (+/- 15.50%)
when I remove the indent space the code just gives me this result:acc: 64.47%
64.47% (+/- 0.00%)
My own written code after removing unindent is this:

for train, test in kfold.split(X, Y):
# create model
model = Sequential()
# Compile model
model.fit(X[train], Y[train], epochs=15, batch_size=10, verbose=0)
# evaluate the model
scores = model.evaluate(X[test], Y[test], verbose=0)
print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print(“%.2f%% (+/- %.2f%%)” % (numpy.mean(cvscores), numpy.std(cvscores)))

I just write the section that i have chaned= removing indent.
the differents between mine and yours are just remiving indent for these 5 command lines which are :

model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)

scores = model.evaluate(X[test], Y[test], verbose=0)
print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)

what is the reason which cause me this error (model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
^
IndentationError: unindent does not match any outer indentation level) when i write code as yours.
how can i fix the error?
Sorry for noting long.
Best REgards
Maryam

• Jason Brownlee March 4, 2018 at 6:05 am #

Ensure that the code is on one line, e.g. this is a python syntax error.

• Maryam March 5, 2018 at 7:19 am #

It is working fine now.
Thank u.

35. SUH March 6, 2018 at 1:56 am #

Hi, Jason. Thank you for your tutorial. I’d like to apply the KStratifiedFold to my code using Keras, but I don’t know how to do it. This is based on the tutorial from the Keras blog post ”
Building powerful image classification models using very little data”. In here, the author of the code uses the ‘fit_generator’, instead of ‘X = dataset[:,0:8], Y = dataset[:,8]’

How can I make this work? I’ve been scratching my head for weeks, but I am out of idea…

I’m open to all suggestions and any answers will be appreciated.

Regards,

SUH

p.s: Here’s the full code.

• Jason Brownlee March 6, 2018 at 6:17 am #

Sorry, I cannot debug your code for you. Perhaps post to stackoverflow.

36. mark March 7, 2018 at 4:42 am #

hi,
thanks for your hard work. every time I have a question google usually sent me to your site.

I came here because I googled ” how to calculate AUC with model.fit_generator” I found that one of your reader had similar issue but you only use “ImageDataGenerator”to augment.
I tried
——————————————————————————————————————
from sklearn.metrics import roc_curve, auc

y_pred = InceptionV3_model.predict_generator(test_generator,2000 )
but dont know how to get y_test
roc_auc_score(y_test, y_pred)

• Jason Brownlee March 7, 2018 at 6:17 am #

y_test would be the expected outputs for the test dataset.

37. CHIRANJEEVI March 8, 2018 at 5:12 am #

HI Jason
please elaborate the difference between the validation,training_score and test score?

38. alice March 11, 2018 at 6:11 pm #

Hi,

Many Thanks for this great post.. learnt a lot…I followed the 5 fold cross validation approach for my dataset, that contains 2000 posts and used 25 epochs.. The accuracy keep increasing, after every fold and finally reached more than 97%… But, in your blog, the acc either increases or decreases in all the folds.. could you please explain the reason, why my results are different…

PLOTTING ACCURACY, PRECISION, RECALL FOR FOLD: 1
Accuracy: : 92.18%

PLOTTING ACCURACY, PRECISION, RECALL FOR FOLD: 2
Accuracy: : 95.84%

PLOTTING ACCURACY, PRECISION, RECALL FOR FOLD: 3
Accuracy: : 98.03%

PLOTTING ACCURACY, PRECISION, RECALL FOR FOLD: 4
Accuracy: : 99.26%

PLOTTING ACCURACY, PRECISION, RECALL FOR FOLD: 5
Accuracy: : 100.00%

Accuracy of 5-Fold Cross Validation with standard deviation:

97.21% (+/- 2.57%)

39. Mary March 27, 2018 at 3:51 pm #

Hi, Jason
I’m new in python and deep learning machine
thank you for your all tutorials I learnt too much so far
I have question can you explain to me simply what does it mean
what is val_acc , loss and val-loss in the model what it does tell
I read many comments and articles but I could not get it

poch 145/150
514/514 [==============================] – 0s – loss: 0.4847 – acc: 0.7704 – val_loss: 0.5668 – val_acc: 0.7323
Epoch 146/150
514/514 [==============================] – 0s – loss: 0.4853 – acc: 0.7549 – val_loss: 0.5768 – val_acc: 0.7087
Epoch 147/150
514/514 [==============================] – 0s – loss: 0.4864 – acc: 0.7743 – val_loss: 0.5604 – val_acc: 0.7244
Epoch 148/150
514/514 [==============================] – 0s – loss: 0.4831 – acc: 0.7665 – val_loss: 0.5589 – val_acc: 0.7126
Epoch 149/150
514/514 [==============================] – 0s – loss: 0.4961 – acc: 0.7782 – val_loss: 0.5663 – val_acc: 0.7126
Epoch 150/150
514/514 [==============================] – 0s – loss: 0.4967 – acc: 0.7588 – val_loss: 0.5810 – val_acc: 0.6929

Thank you once again

• Jason Brownlee March 27, 2018 at 4:20 pm #

val_loss is the calculated loss on the validation dataset.
val_acc is the calculated accuracy on the validation dataset.

They are different from loss and acc that are calculated on the training dataset.

Does that help?

• Mary March 28, 2018 at 12:25 pm #

Yes Thank you 🙂

40. Harrison April 5, 2018 at 9:09 am #

Hello, thanks a lot for this tutorial i have been searching for something like this, i am glad i finally found it. but i would like to know how i can visual the training of this neural network in the example, in one of your tutorial i could plot “val_acc” against “acc” but i can not do same here because there is no “val_acc” here in k fold validation. so please how do i do this here if i evaluate with k fold validation. thank you

41. John April 5, 2018 at 4:22 pm #

Hi!

Thanks for the tutorial, very helpful. I used the validation_data approach, and it seems to be working and producing different accuracies for each epoch (presumably between train and validation), but it gives a puzzling statement before the model starts:

“Train on 15,000 samples, validate on 15,000 samples.”

Does that mean I messed up and fed it the same data for both train and validation or am I ok? In my case, the train has 15,000 samples and the validation file has 10,000 samples.

Thanks so much!

• Jason Brownlee April 6, 2018 at 6:21 am #

Perhaps confirm the size/shape of the train and validation dataset, just to make sure you have set things up the way that you expect.

• John April 6, 2018 at 2:44 pm #

Yeah, it is clearly reading in the train dataset twice. Thanks!

42. Don April 13, 2018 at 3:46 am #

Hi,
Thank you for this. How can I change the manual Kfold cross validation to work for multiclass? Say Y was a 100 by 4 array of zeros and ones?

• Jason Brownlee April 13, 2018 at 6:44 am #

The model is multiclass, not cross validation.

A deep learning model can support multi-class classification by having one neuron for each class in the hidden layer and using the softmax activation function.

43. Victor Vargas April 16, 2018 at 10:21 am #

If the goal of a training phase is to improve our model acc and reduce model’s loss on every epoch, why do you create and compile a new model on every fold iterarion?

• Jason Brownlee April 16, 2018 at 2:58 pm #

Great question.

We want to know how skillful the model is on average, when trained on a random sample from the domain, and making predictions on other data in the domain.

To calculate this estimate, we use resampling methods like k-fold cross-validation that make good economical use of a fixed sized training dataset.

Once we select a model+config that has the best estimated skill, we can then train it on all of the available data and use it to start making predictions.

Does that help Victor?

44. Nicholas Angelucci July 4, 2018 at 12:19 am #

Hello!
Is it possible to plot the accuracy and loss graphs (history) resulting from a validation made with the StratifiedKFold class?

• Jason Brownlee July 4, 2018 at 8:24 am #

Yes, but you will have one plot per fold. You may also have to iterate the folds manually to capture the history and plot it.

45. Nicholas Angelucci July 4, 2018 at 7:20 pm #

Yes, I am able to make a plot per fold, but i want to make only two graphs, one with the accuracy mean and the other with mean loss for k-fold.

• Jason Brownlee July 5, 2018 at 7:40 am #

You can create two plots and add a line to each plot for each fold.

46. Tien Wang July 7, 2018 at 8:13 am #

I found that if you training data input “X” is a Pandas DataFrame, then you will have to use X.loc[train] to have it work. Otherwise, the indexing of a DataFrame directly supplying numpy array (that the unpack of kfold.split are numpy arrays) will throw KeyError.

• Jason Brownlee July 8, 2018 at 6:15 am #

In the above tutorial we are loading data as a NumPy array.

47. kuda July 9, 2018 at 10:26 pm #

Hi Doc. Thanks for your examples, they are straight forward. Can you do the pima implementation of cnn in r?

• Jason Brownlee July 10, 2018 at 6:48 am #

No, it has no temporal or spatial structure.

48. Abdur Rahaman July 30, 2018 at 6:30 am #

Hi,
I have problem printing the confusion martix, classification report and draw AUC curve.
It will be helpful if you provide me how I can print confusion matrix, classification report and draw AUC curve in Neural network using keras in 10 fold cross validation.

Thanks

49. Alay August 16, 2018 at 7:54 am #

Hi Jason..
Thank you so much for all tutorials.

How can use Cross validation with flow_from_directory and fit_generator ??

• Jason Brownlee August 16, 2018 at 1:57 pm #

Sorry, I don’t have an example of this combination.

50. Samin September 12, 2018 at 2:44 am #

Hi Jason,

In your cross-validation code, you used 150 epochs, but the results show only 1 full round of 10-fold cross validation. You only printed the outputs for first epoch?

I really have problems understanding cross validation and epochs. Here is how I understood the code:
is it like each epoch consists of one full run of cross validation? So, in first epoch we run 10-fold cross validation and report the average validation error, in second epoch we continue with the model prepared in first epoch and again run k-fold cross validation and so on till epoch 150? And about the mini-batches, is it like that we divide each training set to 10 mini-batches to do forward backward pass, and after completing the training on whole 10 batches, we refer to validation set to estimate validation error?

Could you tell me whether I perceived the concept of cross validation correctly or not?

thanks,
Samin

51. Akim Borbuev October 1, 2018 at 4:06 am #

Hello Jason,

First of all, thank you for you comprehensive and insightful tutorials, your website is my first stop when I look for the answers. I have implemented a deep neural network for time series forecast. The problem I am facing is that my fitting and evaluating part go well and I have somewhere around 0.8 accuracy for both of them. But when I try to predict for a new batch of samples, I get zeros all over my predictions. I would be very grateful if you can suggest me where to start debugging me model. I can also send my model if you would like to take a look it.
Thank you.

Regards,
Akim

52. Okocha October 11, 2018 at 7:02 pm #

Should not we set_learning_phase to 0 right before calling the evaluate function? According to keras documentation while doing inference we should set it to 0.

• Jason Brownlee October 12, 2018 at 6:37 am #

Perhaps it is set automatically when evaluate() is called?

53. Manikandan Sathiyanarayanan October 26, 2018 at 12:28 am #

hi .i would like to know some basic things about validation data. shall i keep 0.2 % percentage of total training data in validation data folder manually without use validation data split function . will it useful while training model

54. ipek October 31, 2018 at 1:49 am #

Why do you create model for each iteration in kfold instead of creating model once and then calling model.fit in iterations?

• Jason Brownlee October 31, 2018 at 6:29 am #

A new set of weights is required to fit a new model on each iteration.

55. Effe December 2, 2018 at 3:43 am #

‘Hi! How can I use K-fold validation in multiple label problems?

• Jason Brownlee December 2, 2018 at 6:22 am #

Directly, no change. What problem are you having exactly?

56. Joseph December 7, 2018 at 4:39 am #

Hi Jason,

Great article first of all!

In your example you perform model compilation within each fold. That is very slow. I am wondering whether I will achieve the same results if I move the model compilation outside of the for loop like the folllowing?

I guess what I am really asking is, once I’ve initialized and compiled the model, will each call to model.fit() perform an independent fitting using the current folds of training and validation data set without being interfered by weights obtained from the last loop?

If yes, then I suppose it’s faster to do my version of the code as the result will be the same.
If no, and if I still want to only initialize and compile the model once before the for loop, is there any way to reset the model after each model.fit()?

Many thanks
Joseph

• Jason Brownlee December 7, 2018 at 5:26 am #

To be safe, I’d rather re-define and re-compile the model each loop to ensure that each iteration we get a fresh initial set of weights.

57. sanghita December 21, 2018 at 3:27 pm #

How to modify the below code for plotting graph with k-fold cross validation?

def build_classifier():
classifier = Sequential()
classifier.add(Dense(units = 6, kernel_initializer = ‘uniform’, activation = ‘relu’, input_dim = 13))
classifier.add(Dense(units = 6, kernel_initializer = ‘VarianceScaling’, activation = ‘relu’))
classifier.add(Dense(units = 6, kernel_initializer = ‘VarianceScaling’, activation = ‘relu’))
classifier.add(Dense(units = 3, kernel_initializer = ‘VarianceScaling’, activation = ‘softmax’))
classifier.compile(optimizer = ‘adam’, loss = ‘categorical_crossentropy’, metrics = [‘accuracy’])
return classifier
classifier = KerasClassifier(build_fn = build_classifier, batch_size = 5, epochs = 200, verbose=1)
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = 1)
mean = accuracies.mean()
variance = accuracies.std()

Thanks and regards
Sanghita

• Jason Brownlee December 22, 2018 at 6:01 am #

For k-fold cross-validation, I would recommend plotting the distribution scores across the folds, e.g. with a box and whisker plot.

Collect the scores in a list and pass them to pyplot.boxplot()

58. ammara December 31, 2018 at 4:24 pm #

Hi, jason
What is the limitation/disadvatage of using a manual verification dataset.

59. Partha Nayak February 19, 2019 at 3:33 am #

Can I write like this:

result = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)

and then:

plot_loss_accuracy(result)

so that result can be used for kfold validation of scikit-learn as well as confusion matrix display of Keras?

60. Gabby February 26, 2019 at 8:44 pm #

Im just wondering why you use X_test,y_test for validation_data?

• Jason Brownlee February 27, 2019 at 7:25 am #

An easy shortcut for the tutorial. Ideally we would use a separate dataset.

61. Meisam March 4, 2019 at 9:50 pm #

Shouldn’t each fit during cross-validation be saved so that the best fit can be used later?

62. Jacob MB March 16, 2019 at 12:03 am #

Dear Jason
Some Question here that in my eyes make little sense:

So far, I have produced the following code in python using Keras with Tensorflow backend (1 batch, sequence of 1).

#Define model
model = Sequential()
model.add(LSTM(128, batch_size=BATCH_SIZE, input_shape=(train_x.shape[1],train_x.shape[2]), return_sequences=True, stateful=False ))#,,return_sequences=Tru# stateful=True

#Compile model
model.compile(
loss=’sparse_categorical_crossentropy’,
optimizer=opt,
metrics=[‘accuracy’]
)

model.fit(
train_x, train_y,
batch_size=BATCH_SIZE,
epochs=EPOCHS,#,
verbose=1)

#Now I want to make sure that the we can predict the training set (using evaluate) and that it is the same result as during training
score = model.evaluate(train_x, train_y, batch_size=BATCH_SIZE, verbose=0)
print(‘ Train accuracy:’, score[1])

The Output of the code is

Epoch 1/10 5872/5872 [==============================] – 0s 81us/sample – loss: 0.6954 – acc: 0.4997
Epoch 2/10 5872/5872 [==============================] – 0s 13us/sample – loss: 0.6924 – acc: 0.5229
Epoch 3/10 5872/5872 [==============================] – 0s 14us/sample – loss: 0.6910 – acc: 0.5256
Epoch 4/10 5872/5872 [==============================] – 0s 13us/sample – loss: 0.6906 – acc: 0.5243
Epoch 5/10 5872/5872 [==============================] – 0s 13us/sample – loss: 0.6908 – acc: 0.5238

Train accuracy: 0.52480716

So the problem is that the final modeling accuracy (0.5238) should be equal (evaluation) accuracy (0.52480716) which it is not. This makes no sense, why cant we use evaluate on our train data and then obtain the same result as during training? There are no dropouts or anything that should make training different from evaluation. The same happens if I use a validation set

• Jason Brownlee March 16, 2019 at 7:53 am #

The score during training is estimated across the batches I believe, and is reported before weight updates.

You could use early stopping to save the weights for the model after a specific batch I believe. Perhaps a custom early stopping/checkpoint callback?

63. Tiger229 March 16, 2019 at 12:05 am #

hi Jason ,
thank you for the helpful tutorial .
I have trained keras model for semantic segmentation with ICNET .
how could evaluate my trained model with mIOU (Mean intersection over union ) on the validation set ? any tips or useful articles

• Jason Brownlee March 16, 2019 at 7:54 am #

Sorry, I don’t have a tutorial on calculating mIOU.

64. Eyitayo March 28, 2019 at 1:42 am #

Hello Jason,

Thank you for the detailed work.

Two quick quetstions:

1) Does it suffice to say that a CVscores.mean of 0.78 has better accuracy than a CVscores.mean of 0.58.
2) What is the implication of former?

Thanks

• Jason Brownlee March 28, 2019 at 8:17 am #

Yes.

What do you mean by importance exactly?

65. Tom F March 29, 2019 at 2:05 am #

Mr. Brownlee,

Maybe I’m overthinking but once I get a model with good results it is wrapped into a funtion so I can’t call it from a console. I’m used to using “model.predict(x)”. However, with this code, I get “‘model’ is not defined”. Is finallizing just copying the final model definition to the prompt, compiling it, then fitting it on the learning data, and predicting the unknown data?

Will your code produce the same results as defining the model at the prompt then running it through a KFold loop?

Thanks again!

66. Neel April 19, 2019 at 11:52 pm #

Hi Jason,

Since K fold Cross Validation RANDOMLY splits the data (3,5,10..etc) and then trains the model, is it recommended to use GridSearch / K fold on time series data (Multi Classification using LSTM) ? Because the moment the time series data is randomised, LSTM would loose its meaning right?

67. MAK May 22, 2019 at 6:15 am #

Hello,
I see you split the data in the k-fold manner via scikti tools for get more accurate estimation .
My question is, there are build in function for doing the k-fold split in temporal domain (for example stock price), when there are meaning to the order of the sample
Thanks,
MAK

68. Arjun June 17, 2019 at 4:02 am #

Hi Jason,

Great tutorial.
I am facing an issue where my loss is not getting decreased.
I used categorical labels but StratifiedKFold threw an error which led to convert the labels to numerical.

Modified code:

input_img = Input(shape = (242, 242, 1))

kfold = StratifiedKFold(n_splits=6, shuffle=True, random_state=13)
cvscores = []
for train, test in kfold.split(images, labels):
model = Model(input_img, model(input_img))
# Compile model
# Fit the model
#labels = to_categorical(labels)
model.fit(images[train], labels[train], epochs=epochs, batch_size=batch_size, verbose=1)
# evaluate the model
scores = model.evaluate(images[test], labels[test], verbose=1)
print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
print(“%.2f%% (+/- %.2f%%)” % (numpy.mean(cvscores), numpy.std(cvscores)))

Result:

Epoch 1/100
4000/4000 [==============================] – 8s 2ms/step – loss: 7.1741 – acc: 0.5500
Epoch 2/100
4000/4000 [==============================] – 4s 917us/step – loss: 7.1741 – acc: 0.5500
Epoch 3/100
4000/4000 [==============================] – 4s 922us/step – loss: 7.1741 – acc: 0.5500
Epoch 4/100
4000/4000 [==============================] – 4s 955us/step – loss: 7.1741 – acc: 0.5500
Epoch 5/100
4000/4000 [==============================] – 4s 950us/step – loss: 7.1741 – acc: 0.5500
Epoch 6/100
4000/4000 [==============================] – 4s 948us/step – loss: 7.1741 – acc: 0.5500
Epoch 7/100
4000/4000 [==============================] – 4s 918us/step – loss: 7.1741 – acc: 0.5500
Epoch 8/100
4000/4000 [==============================] – 4s 918us/step – loss: 7.1741 – acc: 0.5500

The loss doesnt seem to decrease , could you let me know where I am going wrong.

Thanks

69. Namthy July 29, 2019 at 3:10 am #

Sir

To my understanding, Cross validation is a method to help choose a model (or) its hyper parameters.

Like if i have

Neural Network 1 with 10 layers
Neural Network 2 with 100 layers

It can guide me choosing either 1 or 2

After identifying the network, to get my final model i should use entire raining set, run it for certain number of epochs and choose the parameters which provided good accuracy during the runs

Is my presumption correct? kindly clarify this doubt?!

• Jason Brownlee July 29, 2019 at 6:19 am #

Almost. CV is only used to estimate the performance of the model.

You must then interpret the estimated performance of each model/config and choose.

You can do this directly or use statistical hypothesis testing methods, or other methods.

Yes, afterward, you fit on all data and start using the model.

70. EL BOUNY July 29, 2019 at 8:35 pm #

Hi Mr. Jason

Firstly thank you for your very well explained tutorials

I would like to know how to compute the overall confusion matrix of a deep learning model when using the k-fold cross validation ?

Thanks.

• Jason Brownlee July 30, 2019 at 6:10 am #

You cannot.

A confusion matrix is for a single run only.

Cross validation estimates model performance over multiple runs.

• EL BOUNY July 30, 2019 at 10:05 am #

My purpose is that if I can get the confusion matrix at each fold, then the overall

confusion matrix can be obtained as the sum of all confusion matrices resulting from all

folds. In fact, the performance measure (i.e. accuracy) of the model is the average value

across all folds. Thus, by summing all confusion matrices, the accuracy of the model can

be computed as the ratio between the sum of the diagnoal elements of the resulted

confusion matrix and the sum of all elements.

Thanks again!

• EL BOUNY July 30, 2019 at 10:13 am #

The confusion matrix at each fold is computed only based on the results of the model on the test set. I have asked this query because I have founded in various works in my field that the authors use the k-fold cross validation to evaluate the model, and in the same time they draw the confusion matrix of the model.

• Jason Brownlee July 30, 2019 at 2:08 pm #

I would expect the confusion matrix is reported based on a standard test set for the dataset.

• Jason Brownlee July 30, 2019 at 2:07 pm #

I would not recommend it as each cell of the matrix would need to report mean and variance. It would be confusing.

• EL BOUNY July 30, 2019 at 7:45 pm #

Thanks a lot. In this case, you have an idea on how to draw the total confusion matrix ?. Because, In a numerous papers that I reads, the authors have used the k-cross validation and they have designed jointly the confusion matrix, for example they write as title “confusion matrix obtained using 10-fold cross validation”.

Note : The sum of all elements of the matrix is equal to the size of the overall dataset.

Thanks.

71. Ziko July 31, 2019 at 11:38 pm #

Hi Jason, thx a lot i’m getting lots of help from your tutorials, i’m new for both python and machine learning.
i took your code sample for strat k-fold here and used it with some changes on my data and got good results.
i am saving the best fitted model from the k-fold for future predictions. my question is how to save average model or some how to save the entire model.
once again thx a lot
my code: sorry i’m sure how to place it in as a code

seed = 6
np.random.seed(seed)

# split into input (X) and output (Y) variables
X = train_inputs.copy()
Y = train_targ.copy()
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
model1 = Sequential()
accc_trn=0
accc_tst=0
for train, test in kfold.split(X, Y):
# create model
model = Sequential()
print(‘model’,type(model))
# Compile model
# Fit the model
history = model.fit(X[train], Y[train], \
validation_data=(X[test], Y[test]), \
epochs=150,batch_size=32, verbose=2)
_, train_acc = model.evaluate(X[train], Y[train], verbose=0)
_, test_acc = model.evaluate(X[test], Y[test], verbose=0)
print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc))
if test_acc>accc_tst and train_acc>accc_trn:
accc_tst=test_acc
accc_trn=train_acc
model1=deepcopy(model)

# plot history
plt.plot(history.history[‘acc’], label=’train’)
plt.plot(history.history[‘val_acc’], label=’test’)
plt.legend()
plt.show()
cvscores.append(test_acc * 100)

print(“%.2f%% (+/- %.2f%%)” % (np.mean(cvscores), np.std(cvscores)))
model1.save(‘my_class.h5’)
_, test_newacc = new_model.evaluate(valid_inputs, valid_targ, verbose=0)
print(‘Test: %.3f’ % (test_newacc))

72. Ziko August 3, 2019 at 8:33 pm #

Hi Jason thx a lot
I have another question concerning strat-k-fold, i read at some places that unbalanced inputs cause model to be fitted to produce unbalance predictions (whether binary or multi categories ) therefore it’s recommended to drop inputs in order to train the model better on balanced data, if i understood correctly from you the way strat-k-fold works is taking into consideration the unbalanced nature of the data but i’m not sure what is better: train a strat-k-fold model with unbalanced input that actually represent the population or balance the data before training.
thx
Ziko

73. Ziko August 4, 2019 at 4:49 pm #

Hi again, maybe i will try to generalized my question, if i train a model with a certain proportion, does the model actually capture the proportion of the data and therefore will be better to predict data with similar proportion? so if i suspect for a future data to be with different proportion, i should prepare training data with more closely proportion to the future data?
so if the model does capture training proportion and i have a data that change proportion with time or by any other matter i should train it again for different proportions?

• Jason Brownlee August 5, 2019 at 6:47 am #

It tries to.

Yes, a given data should be generally representative of the problem.

74. crakama August 7, 2019 at 1:57 am #

Hi Jason. Does use of ” Manual Verification Datasets.” help prevent overfitting ? or K-Fold is much better ?

• Jason Brownlee August 7, 2019 at 8:02 am #

It really depends.

Validation dataset is really good for hyperparameter tuning.

75. ziko August 25, 2019 at 7:44 pm #

hello, Jason.
using keras model i get zero accuracy for perfectly linear relation of output vs input, i’m not sure if i interpreted wrongly the accuracy or doing something wrong with my code any help will be appreciated

i’v tried adding more layers, more epochs and so on nothing changed

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf

from keras import models
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras import optimizers
from sklearn.model_selection import KFold

from sklearn.preprocessing import MinMaxScaler
tf.reset_default_graph()
from keras.optimizers import SGD

siz=100000
inp=np.random.randint(100, 1000000 , size=[siz,1])
a1=1.5;
uop=np.dot(inp,a1)
normzer_inp = MinMaxScaler()
inp_norm = normzer_inp.fit_transform\
(inp)
normzer_uop = MinMaxScaler()
uop_norm = normzer_uop.fit_transform\
(uop)

X=inp_norm
Y=uop_norm

kfold = KFold(n_splits=2, random_state=None, shuffle=False)
cvscores = []
opti_SGD = SGD(lr=0.01, momentum=0.9)
model1 = Sequential()

for train, test in kfold.split(X, Y):
model = Sequential()

model.compile(loss=’mean_squared_error’, optimizer=opti_SGD,\
metrics=[‘accuracy’])

history = model.fit(X[train], Y[train], \
validation_data=(X[test], Y[test]), \
epochs=10,batch_size=2048, verbose=2)
_, train_acc = model.evaluate(X[train], Y[train], verbose=0)
_, test_acc = model.evaluate(X[test], Y[test], verbose=0)
print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc))

plt.plot(history.history[‘acc’], label=’train’)
plt.plot(history.history[‘val_acc’], label=’test’)
plt.legend()
plt.show()
cvscores.append(test_acc * 100)

print(“%.2f%% (+/- %.2f%%)” % (np.mean(cvscores)\
, np.std(cvscores)))

76. mustafa mohammed September 1, 2019 at 4:22 am #

Hello dear
How to do k-Fold Cross Validation on this code?
model = Sequential()

# Fit the model
history = model.fit(train_X, train_y, epochs=150,validation_data=(test_X, test_y),batch_size=24,verbose=2,shuffle=False)

#pyplot.plot(history.history[‘loss’], label=’train’)

77. mustafa mohammed September 1, 2019 at 9:22 pm #

hello again

When adding accuracy there is no change in the accuracy of the training values and the accuracy remains zero with the test values? why?

# design network
learning_rate = 0.001
model = Sequential()

# Fit the model
history = model.fit(train_X, train_y, epochs=50,validation_data=(test_X, test_y),batch_size=24,verbose=2,shuffle=False)

78. mustafa mohammed September 4, 2019 at 2:25 am #

Thank you very much for the information answers
I have the last request, i want particular swarm optimization ( PSO ) code in python to find the best weights and bias and the best number of hidden layers and the number of nodes in each hidden layer.
Thank you very much and grateful for the publication of science

• Jason Brownlee September 4, 2019 at 6:02 am #

Sorry, I don’t have any tutorials on PSO for neural nets.

79. Jack September 9, 2019 at 6:20 am #

Hi Jason, I’m very interested in your StratifiedKFold part, I tried it myself but I got an error, the code is as below:
from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

# normalize the data
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

X_train = train_data
y_train = train_targets
X_test = test_data
y_test = test_targets
#%%
from keras import models
from keras import layers
from sklearn.model_selection import StratifiedKFold

def build_model():
# Because we will need to instantiate
# the same model multiple times,
# we use a function to construct it.
model = models.Sequential()
input_shape=(train_data.shape[1],)))
model.compile(optimizer=’rmsprop’, loss=’mse’, metrics=[‘mae’])
return model

model = build_model()
model.fit(X_train, y_train)

kfold = StratifiedKFold(n_splits=10)
for train, test in kfold.split(X_train, y_train):
model = build_model()
model.fit(train, test)
print(model.evaluate(X_test, y_test, verbose = 0))

And the error is ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘continuous’ instead.
It seems that this method doesn’t apply for float training set since my X_train looks like:

array([[-0.27224633, -0.48361547, -0.43576161, …, 1.14850044,
0.44807713, 0.8252202 ],
[-0.40342651, 2.99178419, -1.33391162, …, -1.71818909,
0.43190599, -1.32920239],
[ 0.1249402 , -0.48361547, 1.0283258 , …, 0.78447637,
0.22061726, -1.30850006],

So how am I supposed to do cross validation in deep learning with this kind of data?

• Jason Brownlee September 9, 2019 at 1:53 pm #

Sorry to hear that.

I believe StratifiedKFold is only appropriate for classification predictive modeling problems, not regression problems.

80. David November 5, 2019 at 6:34 am #

In the last line, you compute numpy.std(cvscores). How do you use this information?

81. Sana November 28, 2019 at 4:44 am #

Hi Jason,
I want to use kfold.split(x,y) but I have this error: ValueError: Found array with dim 3. Estimator expected <= 2. because my x.shape=(2000, 1400, 3, 6) and y.shape=(2000, 1400, 3)
Should i reshape my input data to 2d array ?
Thank you

82. Kamal Pandey December 12, 2019 at 12:44 am #

How can i use k-fold validation with this one? Help is appreciated.

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_sequences(X_train)
X_test = tokenizer.texts_to_sequences(X_test)
vocab_size = len(tokenizer.word_index) + 1
def model():
input_shape = (MAX_SEQUENCE_LENGTH,)
model_input = Input(shape = input_shape, name = “input”, dtype = ‘int32′)
embedding = glove_embd(model_input)
lstm = LSTM(100, dropout=0.3, recurrent_dropout=0.3, name=”lstm”)(embedding)
model_output = Dense(2, activation=’softmax’, name=”softmax”)(lstm)
model = Model(inputs=model_input, outputs=model_output)
return model
model = model()
model.compile(loss=’binary_crossentropy’,
metrics = [‘accuracy’])
history = model.fit(X_train, Y_train, batch_size=1500, epochs=50, verbose=1, validation_data=(X_test, Y_test))
loss, accuracy = model.evaluate(X_train, Y_train, verbose=False)

• Jason Brownlee December 12, 2019 at 6:27 am #

You must use repeated walk-forward validation. k-fold cross validation is invalid for sequence data.

• Kamal Pandey December 13, 2019 at 2:46 am #

Thank you for this one I am using sentiment analysis. Is walk-forwad validation used for text classification like sentiment analysis? I am new to this topic.

• Jason Brownlee December 13, 2019 at 6:04 am #

Not really needed. You can use cross-validation if the docs are independent.

83. Sruthy December 15, 2019 at 6:51 pm #

HI Jason,

Thank you for the information. Helped me a lot. I have a question. After k-fold validation. say 5-fold. We will get 5 accuracy values for each fold, right? After the whole 5-fold validation, if I need to do another testing using data from outside, we can call the model to evaluate, right? So I was wondering which trained network is I am calling? Is it the last folded one or the averaged model of all the 5 folds. Which is the final network?

Kindly advice me for the same.

Many Thanks,
Sruthy

• Jason Brownlee December 16, 2019 at 6:15 am #

You can calculate the mean value from cross validation to estimate the skill of a model.

To make a prediction on new data, you fit a new model on all available data and make a prediction.

84. Mina January 5, 2020 at 3:50 am #

Hi Jason,
When I use
scores = model.evaluate(X[test], Y[test], verbose=0), while metric that has been used in compile is ‘accuracy’, the result that I get is very different from when I compute accuracy for the predicted results. What is the problem?

• Jason Brownlee January 5, 2020 at 7:07 am #

Accuracy on the same data via evaluate and via manual calculation should be identical.

If not, check for a bug in your code.

85. Matthew February 16, 2020 at 6:29 am #

Hey Jason!

Love the article, I’ve been looking for a Tensorflow CV solution for a long time and this explains it perfectly. I do have a question though. Not looking for you to correct my code, just maybe give your insight into what’s going on here.

I’m using almost an identical code to yours above for my dataset (I’ll show code below), but I’m running into an issue where after each training run of 200 epochs, the model gets closer and closer to 100% accuracy? I thought maybe my test dataset was remaining the same and the model was being fed a new training dataset fold after each iteration, but I checked and the test dataset does change each time, so I’m at a loss. My code and output looks as below:

Any ideas? Love the site! Thanks again!

• Jason Brownlee February 17, 2020 at 7:37 am #

Yes, you must re-define the model for each fold of the CV. Otherwise the model just continues learning from the last fold.

• Matthew February 17, 2020 at 4:45 pm #

Right on, thanks so much for taking the time! And keep up the great work! We really appreciate it.

86. AGGELOS PAPOUTSIS February 20, 2020 at 9:00 pm #

hi jason,

i am a little confused about your implementation. You say in the description that stratified k fold splits the training data in k folds. But then you use :

for train, test in kfold.split(X, Y)

so why you take the label Y?

• Jason Brownlee February 21, 2020 at 8:21 am #

So the folds are stratified by the class label.

87. aggelos February 20, 2020 at 9:02 pm #

hi jason,

to you have any examples for LOO with lstm in keras?

88. Mira March 14, 2020 at 8:21 am #

Hi, in order for me to evaluate which accuracy and loss should I look at to get good evaluation prediction

• Jason Brownlee March 14, 2020 at 9:53 am #

I recommend tuning model performance based on loss.

I recommend choosing a model based on out of sample performance using a metric that best captures your project goals.

89. Bevan Smith April 16, 2020 at 7:56 pm #

• Jason Brownlee April 17, 2020 at 6:18 am #

You’re welcome, I’m happy to hear that!

90. Manohar May 28, 2020 at 5:01 am #

Does placing

model = Sequential()

and recompiling not reset the model again?

• Jason Brownlee May 28, 2020 at 6:21 am #

It redefines the model.

• Manohar May 28, 2020 at 5:12 pm #

I thought we want to actually keep the same model with its weights etc. and retrain it.

91. MD MAHMUDUL HASAN June 26, 2020 at 9:39 pm #

Hi Jason, Very helpful tutorial. Are there any ways out there to calculate specificity in case of cross-validation?
Thanks.

92. Ali July 26, 2020 at 7:57 pm #

Hi Jason
How to use the above model for calculating confusion matrix and F-score1 and recall and precision?

93. ali July 27, 2020 at 11:17 pm #

Thank you Jason, But I mean that how can I calculate confusion matrix, … after 10-fold cross-validation.
I have two more questions:
The following is my deep model:
1. Every time that I run, I get different accuracy; What should I do to keep the accuracy value constant?
2. The accuracy and loss of train and validation train have so much difference with each other, I think overfitting happens, I even used dropout but I did not succeed. My sample data is 8000, Is it because of lack of sample data or Is there another reason? Is there any way to reduce overfitting?

.
.
.
# my input data is (8000, 100)

MAX_SEQUENCE_LENGTH = 100
MAX_NB_WORDS = 30000
EMBEDDING_DIM = 100
nb_words = max(MAX_NB_WORDS, len(tokenizer.word_index))
model = Sequential()

model.summary()

# Run LSTM Model
batch = 64
epoch = 500
LSTM_model = model.fit(X_train, Y_train, batch_size=batch, epochs=epoch,verbose = 1,shuffle = True, validation_split=0.1)

test_loss , test_acc = model.evaluate(X_test, Y_test)
pred = model.predict(X_test)

Y_pred = []
for p in pred:
if p > 0.5:
Y_pred.append(1)
else:
Y_pred.append(0)
class_names = [‘Negative’, ‘Positive’]
print(classification_report(Y_test.to_list(), Y_pred, target_names = class_names))
cm = confusion_matrix(Y_test.to_list(),Y_pred)

94. Jim August 14, 2020 at 12:06 am #

Hi Jason. Thank you for your help on cross validation in Keras. I would like to do Stratified validation in LSTM for text classificatio (sentiment analysis). Is it possible? If yes can you send me a link from a github or something else? I can’t find a good tutorial on the Internet. Thank you.

95. Anurag Maji September 8, 2020 at 4:39 am #

Hi Jason,
Loved your work, I am currently working on multi-label classification for audio tagging problem, to improve the performance I am planning to apply K-fold but is it possible to apply Stratification K-fold on muti label data easily? or even without Stratification will K-Fold help me to get good results?

• Jason Brownlee September 8, 2020 at 6:53 am #

Thanks!

Not sure stratified k-fold CV is aware of multi-label problems. It is designed for multi-class problems as far as I know.

Perhaps check the doco or test?

96. yongkai LIU December 20, 2020 at 3:00 pm #

where to put validate dataset in the cross validation?

• Jason Brownlee December 21, 2020 at 6:31 am #

Within each fold of cross-validation, you can split the training portion into train and validation and use the validation set to tune the model hyperparameters or early stopping for training.

97. Speedster February 11, 2021 at 4:21 am #

Hi!

I’m trying to fit a neural network onto a dataset (regression) and I want to know whether my steps are correct to get a good and unbiased model

1) Hyperparameter tuning based on 5-fold cross validation. I’m fitting on the training data and validate on my validation data while training. After training I’m calculating my metrics on the validation data. (I do not have a testing set)
I use Early Stopping on my loss (not validation loss) to save some time.

2) Train ALL of my data on the optimal model from the hyperparameter study.

I’m wondering if this approach is okay, since I do not have a testing set. On the other hand I’ve already found a model which does not overfit and generalizes well. So the additional data I have should only improve my model.

I am a little confused.

• Jason Brownlee February 11, 2021 at 5:57 am #

There is no idea of “objectively correct”. Choose a process that gives you confidence you have a robust model for your dataset.

98. Tom June 2, 2021 at 4:20 am #

Hi Jason,

Since for each epoch, we will have different validation accuracy what would be your suggestion to report the final result for publication.
For example, I used 70% of my data for training and 30% for validation.
Validation accuracy for 100 epochs started from 62% (epoch 1) and reached 91% (epoch 100).
How can I report a single number in my paper for my model accuracy?
Should I average the accuracy of 100 epochs?

Thank you.

• Jason Brownlee June 2, 2021 at 5:47 am #

You can use any evaluation procedure you want as long as it is clearly stated, consistent in comparison of methods, and reprodcable.

99. NORAH October 19, 2021 at 3:43 am #

Hi,
why I get this error?

TypeError Traceback (most recent call last)
in
11 print(train, test)
12 # Fit the model
—> 13 model.fit(X[train], Y[train], epochs=20, batch_size=7, verbose=1)
14 # evaluate the model
15 scores = model.evaluate(X[test], Y[test], verbose=1)

TypeError: only integer scalar arrays can be converted to a scalar index

100. John January 24, 2022 at 5:25 am #

Hello Dr. Brownlee! Excellent tutorial…One question only…

When it runs this command:

print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))

I get this error:

TypeError: ‘float’ object is not subscriptable

How can I fix that?

101. Josh March 17, 2022 at 2:32 pm #

Hello!

From your article, I was not able to have a firm understanding in just one part.
If I have a program for categorical crossentropy for CNN using Tensorflow, and for the code

model.compile(loss=’categorical_crossentropy’, metrics=[‘accuracy’])
model.fit(xTrain, yTrain, epochs=100)
loss, accuracy = model.evaluate(xTest, yTest)
print(accuracy)

Can I ask you how “accuracy”, which is returned by evaluate(), is calculated?
I hope to know whether it is an average number of correct prediction. (Total number of correct predictions / Total number of samples).

However, I was not able to find how accuracy from evaluate() from Tensorflow is calculated.
If my opinion is correct, could I ask you which Tensorflow website I could refer to so that I could be sure? Thank you!

102. Tarun July 12, 2022 at 11:59 pm #

How to find Train MAPE and Test MAPE as well as Train Median Absolute Error and Test Median Absolute Error in MLP / CNN / LSTM ?