Evaluate the Performance of Deep Learning Models in Keras

Keras is an easy-to-use and powerful Python library for deep learning.

There are a lot of decisions to make when designing and configuring your deep learning models. Most of these decisions must be resolved empirically through trial and error and by evaluating them on real data.

As such, it is critically important to have a robust way to evaluate the performance of your neural networks and deep learning models.

In this post, you will discover a few ways to evaluate model performance using Keras.

Kick-start your project with my new book Deep Learning With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • May/2016: Original post
  • Update Oct/2016: Updated examples for Keras 1.1.0 and scikit-learn v0.18
  • Update Mar/2017: Updated example for Keras 2.0.2, TensorFlow 1.0.1 and Theano 0.9.0
  • Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down
  • Update Jun/2022: Update to TensorFlow 2.x syntax
Evaluate the Performance Of Deep Learning Models in Keras

Evaluate the performance of deep learning models in Keras
Photo by Thomas Leuthard, some rights reserved.

Empirically Evaluate Network Configurations

You must make a myriad of decisions when designing and configuring your deep learning models.

Many of these decisions can be resolved by copying the structure of other people’s networks and using heuristics. Ultimately, the best technique is to actually design small experiments and empirically evaluate problems using real data.

This includes high-level decisions like the number, size, and type of layers in your network. It also includes the lower-level decisions like the choice of the loss function, activation functions, optimization procedure, and the number of epochs.

Deep learning is often used on problems that have very large datasets. That is tens of thousands or hundreds of thousands of instances.

As such, you need to have a robust test harness that allows you to estimate the performance of a given configuration on unseen data and reliably compare the performance to other configurations.

Need help with Deep Learning in Python?

Take my free 2-week email course and discover MLPs, CNNs and LSTMs (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Data Splitting

The large amount of data and the complexity of the models require very long training times.

As such, it is typical to separate data into training and test datasets or training and validation datasets.

Keras provides two convenient ways of evaluating your deep learning algorithms this way:

  1. Use an automatic verification dataset
  2. Use a manual verification dataset

Use an Automatic Verification Dataset

Keras can separate a portion of your training data into a validation dataset and evaluate the performance of your model on that validation dataset in each epoch.

You can do this by setting the validation_split argument on the fit() function to a percentage of the size of your training dataset.

For example, a reasonable value might be 0.2 or 0.33 for 20% or 33% of your training data held back for validation.

The example below demonstrates the use of an automatic validation dataset on a small binary classification problem. All examples in this post use the Pima Indians onset of diabetes dataset. You can download it from the UCI Machine Learning Repository and save the data file in your current working directory with the filename pima-indians-diabetes.csv (update: download from here).

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example, you can see that the verbose output on each epoch shows the loss and accuracy on both the training dataset and the validation dataset.

Use a Manual Verification Dataset

Keras also allows you to manually specify the dataset to use for validation during training.

In this example, you can use the handy train_test_split() function from the Python scikit-learn machine learning library to separate your data into a training and test dataset. Use 67% for training and the remaining 33% of the data for validation.

The validation dataset can be specified to the fit() function in Keras by the validation_data argument. It takes a tuple of the input and output datasets.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Like before, running the example provides a verbose output of training that includes the loss and accuracy of the model on both the training and validation datasets for each epoch.

Manual k-Fold Cross Validation

The gold standard for machine learning model evaluation is k-fold cross validation.

It provides a robust estimate of the performance of a model on unseen data. It does this by splitting the training dataset into k subsets, taking turns training models on all subsets except one, which is held out, and evaluating model performance on the held-out validation dataset. The process is repeated until all subsets are given an opportunity to be the held-out validation set. The performance measure is then averaged across all models that are created.

It is important to understand that cross validation means estimating a model design (e.g., 3-layer vs. 4-layer neural network) rather than a specific fitted model. You do not want to use a specific dataset to fit the models and compare the result since this may be due to that particular dataset fitting better on one model design. Instead, you want to use multiple datasets to fit, resulting in multiple fitted models of the same design, taking the average performance measure for comparison.

Cross validation is often not used for evaluating deep learning models because of the greater computational expense. For example, k-fold cross validation is often used with 5 or 10 folds. As such, 5 or 10 models must be constructed and evaluated, significantly adding to the evaluation time of a model.

Nevertheless, when the problem is small enough or if you have sufficient computing resources, k-fold cross validation can give you a less-biased estimate of the performance of your model.

In the example below, you will use the handy StratifiedKFold class from the scikit-learn Python machine learning library to split the training dataset into 10 folds. The folds are stratified, meaning that the algorithm attempts to balance the number of instances of each class in each fold.

The example creates and evaluates 10 models using the 10 splits of the data and collects all the scores. The verbose output for each epoch is turned off by passing verbose=0 to the fit() and evaluate() functions on the model.

The performance is printed for each model, and it is stored. The average and standard deviation of the model performance are then printed at the end of the run to provide a robust estimate of model accuracy.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Running the example will take less than a minute and will produce the following output:


In this post, you discovered the importance of having a robust way to estimate the performance of your deep learning models on unseen data.

You discovered three ways that you can estimate the performance of your deep learning models in Python using the Keras library:

  • Use Automatic Verification Datasets
  • Use Manual Verification Datasets
  • Use Manual k-Fold Cross Validation

Do you have any questions about deep learning with Keras or this post? Ask your question in the comments, and I will do my best to answer it.

272 Responses to Evaluate the Performance of Deep Learning Models in Keras

  1. Avatar
    DR Venugopala Rao Manneni July 19, 2016 at 2:14 am #

    how to print network diagram

  2. Avatar
    Hendrik August 30, 2016 at 9:42 pm #

    Could you explain how can one use different evaluation metric (F1-score or even custom one) for evaluation?

    • Avatar
      Jason Brownlee August 31, 2016 at 9:46 am #

      Hi Hendrik, you can use a suite of objectives with Keras models, here’s a lost:

      • Avatar
        Hendrik September 1, 2016 at 5:59 pm #

        Thanks for the reply, but I don’t mean the “optimizer” parameter but the “metrics” at compilation, which is currently can be only “accuracy”. I’d like to change it to another evaluation metric (F1-score for instance or AUC).

    • Avatar
      Rasika Karle February 5, 2017 at 1:17 am #

      Hey Hendrik, did you get the solution to how to use a different evaluation metric in Keras?

  3. Avatar
    shixudong September 19, 2016 at 6:28 pm #

    could you give some instruction on how to train a deep model. if X, y is so large that can not be fit into memory?

    model.fit(X_train, y_train, validation_data=(X_test,y_test), nb_epoch=150, batch_size=10)

    • Avatar
      Jason Brownlee September 20, 2016 at 8:31 am #

      Great question shixudong.

      Keras has a data generator for image data that does not fit into memory:

      The same approach could be used for tabular data:

      • Avatar
        Toqi Tahamid March 5, 2017 at 10:33 pm #

        My dataset are in a data folder like this structure–




        1. How do I know the y_true value of the dataset from ImageDataGenerator, if I use the function flow_from_directory?

        2. How do I use k-fold cross validation using the fit_generator function in Keras?

        • Avatar
          Jason Brownlee March 6, 2017 at 10:58 am #

          Sorry, I don’t have examples of using the ImageDataGenerator other than for image augmentation.

          • Avatar
            shuang April 21, 2021 at 11:05 pm #

            Hi,Do you have to update it now?

  4. Avatar
    andrew jeremy September 22, 2016 at 4:41 am #

    how do can I use k-fold cross validation, the gold standard for evaluating machine learning with the fit function in keras ?

    Also, how can I get an history of accuracy and loss with the cross_val_score module for plotting ?

  5. Avatar
    Watterson October 27, 2016 at 12:16 am #

    Hey Jason, thanks for the great tutorials!
    I wanted to do a CV but read out the accuracy for each fold not only for the training but also for the test data. Would this ansatz be right:

    kfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=seed)
    for train, test in kfold.split(X, Y):
    model = Sequential()
    model.add(Dense(n, input_dim=dim, init=’uniform’, activation=’sigmoid’))
    model.add(Dense(1, init=’uniform’, activation=’sigmoid’))
    model.compile(loss=’mse’, optimizer=’adam’, metrics=[‘accuracy’])
    asd = model.fit(X[train], Y[train], nb_epoch=epoch, validation_data=(X[test], Y[test]), batch_size=10, verbose=1)
    cv_acc_train = asd.history[‘acc’]
    cv_acc_test = asd.history[‘val_acc’]

    • Avatar
      Jason Brownlee October 27, 2016 at 7:46 am #

      Looks good to me off the cuff Watterson.

      • Avatar
        Seun January 15, 2017 at 3:15 am #

        asd = model.fit(X[train], Y[train], nb_epoch=epoch, validation_data=(X[test], Y[test])

        Please, is the validation_data not suppose to be validation_data=(Y[test], Y[test]). Also, can I use categorical_crossentropy when my activation is softmax. Thanks so much.

        • Avatar
          Jason Brownlee January 15, 2017 at 5:30 am #

          Hi Seun, the validation_data must include X and y components.

          Yes, I think you can use logloss with softmax, try and see.

  6. Avatar
    Jonas December 18, 2016 at 2:07 am #

    Hi Jason,

    When you use the “automatic verfication dataset” the val_loss is lower than the loss.
    “768/768 [==============================] – 0s – loss: 0.4593 – acc: 0.7839 – val_loss: 0.4177 – val_acc: 0.8386”

    How can it be possible?

    Thank very much for your help and your work !

    • Avatar
      Jason Brownlee December 18, 2016 at 5:32 am #

      Sorry, I don’t understand your question, perhaps could be more specific?

      • Avatar
        Jonas December 18, 2016 at 8:06 am #

        I understood from my previous lectures that a model is fitting well when the validation error is low and slightly higher than the training error.

        But in your first example (Automatic Verification Datasets), the validation error is lower than the training error. I can’t figure how the model can perform better on the validation set rather than on the training set.

        Does it mean that the validation split isn’t randomly defined?

        • Avatar
          Jason Brownlee December 19, 2016 at 5:29 am #

          Great question Jonas,

          It might be a statistical fluke and a sign of an unstable model.

          It might also be a sign of a poor split of the data, and a sign that a strategy with repeated splits might be warranted.

    • Avatar
      Sachin singh October 5, 2020 at 5:56 pm #

      Why we are taking batch size =10, i mean how does it affect the model performance

  7. Avatar
    Roger January 11, 2017 at 3:45 pm #

    Hi, I am no expert. But it looks like you are training on the binary labels:

    Y = dataset[:,8] => Labels exist in column 8 correct?
    X = dataset[:,0:8] => This includes column 8, i.e. labels

    I could be wrong, haven’t looked at the data set.

    • Avatar
      Jason Brownlee January 12, 2017 at 9:25 am #

      No, I believe the code is correct (at least in Python 2.7).

      I’m happy to hear if you get different results on your system.

  8. Avatar
    wqp89324 March 2, 2017 at 4:58 am #

    Hi, Jason, for a simple feedforward MLP, are there any intuitive criteria for choosing between Keras and Sklearn?


    • Avatar
      Jason Brownlee March 2, 2017 at 8:23 am #

      Speed, Keras will be faster give it is using optimized symbolic math libs as a backend on CPU or GPUs, whereas sklearn is limited to linear algebra libs on the CPU.

  9. Avatar
    pattijane March 5, 2017 at 7:57 am #


    Thanks a lot for your tutorials, they are great!

    This might be a bit trivial but I’d like to ask the difference between when we used a validation split in “model.fit” and we didn’t. And, for instance instead of using separate train/validation/test sets, will using train/test sets with validation split be enough?

    Thanks a lot!

    • Avatar
      Jason Brownlee March 6, 2017 at 10:57 am #

      It depends on your problem.

      If you can spare the data, it is a good idea to hold back a validation set for final checking. If you can afford the time, use k-fold cross-validation with multiple repeats to eval your model. We often don’t have the time, so we use train/test splits.

  10. Avatar
    Jason March 17, 2017 at 7:39 am #

    Hi Jason,

    Very nice example, I enjoyed reading your blog.

    I got one question, how to decide the number of epoch and the batch size?


  11. Avatar
    Carolyn March 21, 2017 at 8:36 am #

    Hi Jason,

    This is a great post!

    I’m having trouble combining categorical cross-entropy and StratifiedKFold.

    StratifiedKFold assumes the response is a (number,) shape, according to:

    But as you’ve explained before, Keras’s categorical cross-entropy expects the response to be a one-hot matrix. How can I resolve this?

    Thank you!

    • Avatar
      Jason Brownlee March 21, 2017 at 8:47 am #

      You might have to move away from cross validation and rely on repeated random train/test sets.

      Alternatively, you could try pre-computing the folds manually or using a modified version of the dataset, the running the cross-validation manually.

      • Avatar
        Moji September 19, 2017 at 7:13 am #

        I guess you can do the following:
        change :
        and then using : loss=’sparse_categorical_crossentropy’

        it works but would that be correct way?

        • Avatar
          Jason Brownlee September 19, 2017 at 7:51 am #

          It is a way. I recommend evaluating many approaches and see what works best for your data.

  12. Avatar
    Luis April 10, 2017 at 7:13 am #

    Hi, Jason.

    your blog is great!

    I’m new on keras. In your last example you build and compile the keras model inside each for iteration (I show your code below). I don’t know if it is possible, but it looks like it would be more efficient to build and complle the model one time (outside the for loop) and to fit it with the right data each time inside the loop. isn’t this possible in keras?

    for train, test in kfold.split(X, Y):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation=’relu’))
    model.add(Dense(8, activation=’relu’))
    model.add(Dense(1, activation=’sigmoid’))
    # Compile model
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    • Avatar
      Jason Brownlee April 10, 2017 at 7:42 am #

      Yes, but you may need to re-initialize the weights.

      I am demonstrating complete independence of the model within each loop.

  13. Avatar
    Manasi April 13, 2017 at 11:23 pm #

    I am trying cross-vlisation code with lstm and getting the following error:
    Found array with dim 3. Estimator expected <= 2.

    My code is as follows:

    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
    cvscores = []
    for train, test in kfold.split(task_features_padded, task_label_padded):
    # create model
    model.add(LSTM(50,input_shape=(max_seq,number_of_features),return_sequences = 1, activation = 'relu'))



    model.fit(task_features_padded[train], task_label_padded[train],batch_size=16,nb_epoch=1000,validation_data=(X_test,y_test),verbose=2)
    #model.fit(X_train, y_train,batch_size=16,nb_epoch=1000,validation_data=(X_test,y_test),verbose=2)
    #model.fit(X_train, y_train,batch_size=1,nb_epoch=1000,validation_data=(X_test,y_test),verbose=2)

    score, acc = model.evaluate(task_features_padded[test], task_label_padded[test], batch_size=16,verbose=0)
    #score, acc = model.evaluate(X_test, y_test, batch_size=16,verbose=0)
    #score, acc = model.evaluate(X_test, y_test, batch_size=1,verbose=0)
    print('Test score:', score)
    print('Test accuracy:', acc)

    scores = model.evaluate(X[test], Y[test], verbose=0)
    print("%s: %.2f%%" % (model.metrics_names[1], acc*100))
    cvscores.append(acc * 100)

    the shape of task_features_padded is (876, 6, 11)
    and shape of task_label_padded is (876, 6, 1)

  14. Avatar
    Arno April 17, 2017 at 3:53 pm #

    Hi Jason,
    In the last part of this article, you are training 10 different models instead of training one 10 times on each fold.
    In other cases, how can I select the best model out of the 10 trained ? Is it a good practice in Machine Learning to do so ?

  15. Avatar
    Nirmala May 15, 2017 at 4:01 pm #

    Hi sir,

    When i try to run the code after building the layers i am facing this error

    FileNotFoundError: [WinError 3] The system cannot find the path specified: ‘C:/deeplearning/openblas-0.2.14-int32/bin’

    I have changed theano flag path using this

    variable = THEANO_FLAGS value = floatX=float32,device=cpu,blas.ldflags=-LC:\openblas -lopenblas

    but still i am facing the same problem…

    Thank you!!

    • Avatar
      Jason Brownlee May 16, 2017 at 8:37 am #

      Sorry, I have not seen this error.

      Consider posting it as a question to stackoverflow or the theano users group.

  16. Avatar
    Connie May 15, 2017 at 5:58 pm #

    Hi Jason,
    How to give prediction score(not prediction label or prediction probability) of each test instance instead of evaluate result on whole test set?

    • Avatar
      Jason Brownlee May 16, 2017 at 8:38 am #

      You can predict probabilities with:

  17. Avatar
    José June 19, 2017 at 10:07 am #

    Hi Jason, thanks for the post. Your blog is great to learn Machine Learning with Python, I am very very grateful for you are sharing this with us. Thanks and great job! Muito obrigado, você é muito generoso em compartilhar seu conhecimento.

  18. Avatar
    Aymane June 30, 2017 at 1:16 am #

    Hey Jason, Thank you for your many posts and responses.
    I have tried to follow this tutorial to train and evaluate a multioutput (3) regression deep network using the keras’ Model class API, and here is my code:

    #Define cross validation scheme
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
    cvscores = []
    for train, test in kfold.split(X_training, [Y_training10,Y_training20,Y_training30]):
    # Create model
    # Compile model
    model.compile(loss=’mean_squared_error’, optimizer=’rmsprop’,metrics=’mean_absolute_error’)
    #Fit model
    model.fit(X_training[train],[Y_training10[train],Y_training20[train],Y_training30[train]] ,epochs=100, batch_size=10, verbose=1)
    # Evaluate model
    Scores=model.evaluate(X_training[test],[Y_training10[test],Y_training20[test],Y_training30[test]], verbose=1, sample_weight=None)
    print(“%s: %.2f%% (MSE)” % (model.metrics_names[1], scores[1]))
    print(“%s: %.2f%% (MSE)” % (model.metrics_names[2], scores[2]))
    print(“%s: %.2f%% (MSE)” % (model.metrics_names[3], scores[3]))
    print(“%.2f MSE of training” % (numpy.mean(cvscores,axis=0)))

    But unfortunately I get this error :
    Found input variables with inconsistent number of samples: [15000,3].

    I have a [15000sample x 72predictor] as X_training and [15000samples x 3 outputs] as [Y_training10, Y_training20, Y_training30].

    Any sort of help would be appreciated.

    • Avatar
      Jason Brownlee June 30, 2017 at 8:13 am #

      You have to make sure your input and output data match the same of your network and that train and test have the same number of features.

  19. Avatar
    Ashley August 8, 2017 at 6:20 am #

    Hi, Thank you for an awesome relevant post. I have a question about implementing training\test\validation 80\10\10. Which is sort of outlined in this question: https://stats.stackexchange.com/questions/19048/what-is-the-difference-between-test-set-and-validation-set – how would I further split my test set?

    • Avatar
      Jason Brownlee August 8, 2017 at 7:52 am #

      I’m glad it helped Ashley.

      Why would you further split your data?

      • Avatar
        Ashley August 9, 2017 at 12:55 am #

        It’s something from the ML course on Udacity. They use a validation set along side test set. It’s outlined in this lecture: https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118330923 Lesson 1 #22. They explain it as your test data bleeding into your training data over time, and this biases the training – so to further split your data, train on train data, validate on validation set and only at the very end test on test data.

        • Avatar
          Ashley August 9, 2017 at 12:57 am #

          This link gets you to #22 https://classroom.udacity.com/courses/ud730/lessons/6370362152/concepts/63798118300923

        • Avatar
          Ashley August 9, 2017 at 1:08 am #

          As I say this – I realise that one way to do it is to train and validate and save the model, then load model and test on the test set.

          • Avatar
            Jason Brownlee August 9, 2017 at 6:36 am #

            Sure, what is your question exactly?

          • Avatar
            Ashley August 9, 2017 at 8:07 am #

            Initially, it was: how do I take in a validation set and a test set and have two different testing out puts after final run. i.e. How do I set up my model to have a split test/validation set (so have training, validation, test all in one session)
            Now I am just assuming that I should train (fit) on my data, optimise my net according to my validation data (evaluate) and save the model. Then reload it and evaluate again but on test data that it has never seen before – which I am hoping will help me with the small data set I have.

          • Avatar
            Jason Brownlee August 10, 2017 at 6:37 am #

            You can do that, sounds fine.

  20. Avatar
    Michael August 17, 2017 at 1:15 am #

    Hey, Jason. Great website! I like how you’ve laid out your posts and how you explain the concepts. I haven’t been able to figure out the final result from using k-fold. Let’s assume I execute a k-fold just like you’ve done in your example. Then, I use the code from another of your great posts to save the model (to JSON) and save the weights. Which of the 10 models (created during the k-fold loop) will be saved? The last of the 10?
    I want to have a saved model that I can use on new datasets in the future. How does creating 10 models with k-fold help me get a better model than using an automatic validation split (as described in this post)?

    Thank you!

    • Avatar
      Jason Brownlee August 17, 2017 at 6:45 am #

      CV gives a less biased estimate of the models skill on unseen data than a train/test split, at least in general with smallish datasets (less than millions of obs).

      Once you have tuned your model, throw all the trained models away and train a final model with all your data and start using it to make predictions.

      See this post:

      Some models are very expensive to train, in which case don’t use CV and keep the best models you train, use them in an ensemble as final models.

      Does that help?

      • Avatar
        Michael August 17, 2017 at 2:15 pm #

        Yes, that post was exactly what I needed. Thank you!

  21. Avatar
    Minkyu Ha August 21, 2017 at 5:34 pm #

    Hello, Jason
    I’m one of your ebook readers 🙂

    Let me ask a strange behavior my case.

    When I train & validate my model with dataset..like

    model.fit( train_x, train_y, validation_split=0.1, epochs=15, batch_size=100)

    it seems to be overfitting to see between acc and val_acc.

    67473/67473 [==============================] – 27s – loss: 2.9052 – acc: 0.6370 – val_loss: 6.0345 – val_acc: 0.3758
    Epoch 2/15
    67473/67473 [==============================] – 26s – loss: 1.7335 – acc: 0.7947 – val_loss: 6.2073 – val_acc: 0.3788
    Epoch 3/15
    67473/67473 [==============================] – 26s – loss: 1.5050 – acc: 0.8207 – val_loss: 6.1922 – val_acc: 0.3952
    Epoch 4/15
    67473/67473 [==============================] – 26s – loss: 1.4130 – acc: 0.8380 – val_loss: 6.2896 – val_acc: 0.4092
    Epoch 5/15
    67473/67473 [==============================] – 26s – loss: 1.3750 – acc: 0.8457 – val_loss: 6.3136 – val_acc: 0.3953
    Epoch 6/15
    67473/67473 [==============================] – 26s – loss: 1.3350 – acc: 0.8573 – val_loss: 6.4355 – val_acc: 0.4098
    Epoch 7/15
    67473/67473 [==============================] – 26s – loss: 1.3045 – acc: 0.8644 – val_loss: 6.3992 – val_acc: 0.4018
    Epoch 8/15
    67473/67473 [==============================] – 26s – loss: 1.2687 – acc: 0.8710 – val_loss: 6.5578 – val_acc: 0.3897
    Epoch 9/15
    67473/67473 [==============================] – 26s – loss: 1.2552 – acc: 0.8745 – val_loss: 6.4178 – val_acc: 0.4104
    Epoch 10/15
    67473/67473 [==============================] – 26s – loss: 1.2195 – acc: 0.8796 – val_loss: 6.5593 – val_acc: 0.4044
    Epoch 11/15
    67473/67473 [==============================] – 26s – loss: 1.1977 – acc: 0.8833 – val_loss: 6.5514 – val_acc: 0.4041
    Epoch 12/15
    67473/67473 [==============================] – 26s – loss: 1.1828 – acc: 0.8874 – val_loss: 6.5972 – val_acc: 0.3973
    Epoch 13/15
    67473/67473 [==============================] – 26s – loss: 1.1665 – acc: 0.8890 – val_loss: 6.5879 – val_acc: 0.3882
    Epoch 14/15
    67473/67473 [==============================] – 26s – loss: 1.1466 – acc: 0.8931 – val_loss: 6.5610 – val_acc: 0.4104
    Epoch 15/15
    67473/67473 [==============================] – 27s – loss: 1.1394 – acc: 0.8925 – val_loss: 6.5062 – val_acc: 0.4100

    but , when I split the data into train & test with StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
    and evaluate it with test data, it shows good result of accuracy.
    score = model.evaluate(test_x, test_y)
    [1.3547255955601791, 0.82816451482507525]

    so.. I tried to validate with test dataset which was split with StratifiedKFold.
    model.fit( train_x, train_y, validation_data=(test_x, test_y), epochs=15, batch_size=100)

    and it shows good result of val_acc.

    Epoch 1/14
    67458/67458 [==============================] – 27s – loss: 2.9200 – acc: 0.6006 – val_loss: 1.9954 – val_acc: 0.7508
    Epoch 2/14
    67458/67458 [==============================] – 26s – loss: 1.8138 – acc: 0.7536 – val_loss: 1.6458 – val_acc: 0.7844
    Epoch 3/14
    67458/67458 [==============================] – 26s – loss: 1.5869 – acc: 0.7852 – val_loss: 1.5848 – val_acc: 0.7876
    Epoch 4/14
    67458/67458 [==============================] – 25s – loss: 1.4980 – acc: 0.8056 – val_loss: 1.5353 – val_acc: 0.8015
    Epoch 5/14
    67458/67458 [==============================] – 25s – loss: 1.4375 – acc: 0.8202 – val_loss: 1.4870 – val_acc: 0.8117
    Epoch 6/14
    67458/67458 [==============================] – 25s – loss: 1.3795 – acc: 0.8324 – val_loss: 1.4738 – val_acc: 0.8139
    Epoch 7/14
    67458/67458 [==============================] – 26s – loss: 1.3437 – acc: 0.8400 – val_loss: 1.4677 – val_acc: 0.8146
    Epoch 8/14
    67458/67458 [==============================] – 26s – loss: 1.3059 – acc: 0.8462 – val_loss: 1.4127 – val_acc: 0.8263
    Epoch 9/14
    67458/67458 [==============================] – 26s – loss: 1.2758 – acc: 0.8533 – val_loss: 1.4087 – val_acc: 0.8219
    Epoch 10/14
    67458/67458 [==============================] – 25s – loss: 1.2381 – acc: 0.8602 – val_loss: 1.4095 – val_acc: 0.8242
    Epoch 11/14
    67458/67458 [==============================] – 26s – loss: 1.2188 – acc: 0.8644 – val_loss: 1.3960 – val_acc: 0.8272
    Epoch 12/14
    67458/67458 [==============================] – 25s – loss: 1.1991 – acc: 0.8677 – val_loss: 1.3898 – val_acc: 0.8226
    Epoch 13/14
    67458/67458 [==============================] – 25s – loss: 1.1671 – acc: 0.8733 – val_loss: 1.3370 – val_acc: 0.8380
    Epoch 14/14
    67458/67458 [==============================] – 25s – loss: 1.1506 – acc: 0.8750 – val_loss: 1.3363 – val_acc: 0.8315

    Do you have any idea the reason why the result of auto validation_split and validation with test dataset ?

    Thanks in advance.

    • Avatar
      Jason Brownlee August 22, 2017 at 6:37 am #

      There is no need to validate when using cross validation. The model is doing twice the work.

      Perhaps the stratification of the data sample is important to your model?

      Perhaps the model performs better on the smaller sample (e.g. 1/10th of data if 10-folds).

    • Avatar
      tianyu zhou August 9, 2018 at 1:44 am #

      I am having same problem here, when you set shuffle=False when do Kfold CV, you will have low accuracy as well. the auto validation_split didnt shuffle the validation data.
      you can try:
      StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
      model.fit( train_x, train_y, validation_split=0.1, epochs=15, verbose=1 ,batch_size=100)
      score = model.evaluate(test_x, test_y)

      you will see val_acc is very low and final score is good.

  22. Avatar
    Macarena September 6, 2017 at 8:18 pm #

    First of all congrats for these tutorials, they are great!
    I’m trying to use StratifiedKFold validation in a multiple inputs network (3 inputs) for a regression problem, and I’m having several problems when using it. First of all, in the step:
    “for train,test in kfold.split()” I’m introducing just one of the inputs and the labels structure, this way: “for train, test in kfold.split(X1, Y):”, and then inside the loop I define “X1_train = X1[train], X2_train = X2[train], X3_train=X3[train]” and so on. This way, when fitting my model I use “model.fit([X1_train, X2_train, X3_train], Y_train….)”. But I’m getting the error “n_splits=10 cannot be greater than the number of members in each class”, and I don’t know how to fix it.

    I have also try the option you give in this tutorial: https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/
    But in this case the error I get is “Found input variables with inconsistent numbers of samples”.

    I don’t know how can I implement this, I would appreciate any help. Thanks.

    • Avatar
      Jason Brownlee September 7, 2017 at 12:53 pm #

      All rows must have the same number of columns, if that helps?

      • Avatar
        Macarena September 7, 2017 at 9:40 pm #

        X1, X2 and X3 have shape (nb_samples, 2, 8, 10), while Y has shape (nb_samples, 4). I don’t know if it is not able to recognize that the common axis is nb_samples (although I read in the documentation that it takes by default the first axis).
        I have resolved it creating an structure of zeros with just one dimension: X_split = np.zeros((X1.shape[0])) and Y_split = np.zeros((Y.shape[0])), and I use those two arrays to create the for loop. But I don’t know why I cannot do it the other way.

        • Avatar
          Despoina February 15, 2019 at 8:03 am #

          Hello, I am facing the same issue. Could you please provide the code example?
          You are creating X_split= = np.zeros((X1.shape[0])) for X2 and X3 and Y_split before
          kfold.split(X1, Y)?

          Thank you

  23. Avatar
    Siyan September 27, 2017 at 12:59 am #

    Hi Jason, thanks for your blogs and I learned a lot from your posts.
    I encountered a strange problem while using Keras, my problem is regression problem and I would like to show and record the loss and validation loss while training. But when I only assign “validation_split”, then I can only get the “loss” without any “validation loss”, after I manually assign the “validation_data” into model.fit the I can get either “loss” and “validation loss”.
    From the document of Keras, “validation_split” will use last XX% of data without shuffle as the validation data, I assume it should have the “validation loss” as well, but I cannot find and get it. Do you have any ideas about it, thanks in advance!

    • Avatar
      Jason Brownlee September 27, 2017 at 5:43 am #

      If you set validation data or a validation split, then the validation loss will be printed each epoch if verbose=1 and available in the history object at the end of the run.

  24. Avatar
    Nafiz September 27, 2017 at 11:35 am #

    Hi Jason, after doing the k-fold CV, how do you train the NN on your whole data set? Because usually, we train it as long as the validation set accuracy is increasing. Before applying the NN into the wild, we would like to train it on the whole data set, if our data set size is small. How do we train it then?

  25. Avatar
    Shahab October 21, 2017 at 8:49 pm #

    Hi Jason. i have trained a network using Keras for segmentation purpose of MRI images. My test data has no ground truth. I need to save the output of network for test set as Images and submit them for evaluation. I would like to ask you how can I do this procedure in testing part. As I know for evaluation in Keras, I need both test samples and the coressponding ground truth!!!!

    • Avatar
      Jason Brownlee October 22, 2017 at 5:18 am #

      Correct, you need ground truth in order to evaluate any model.

  26. Avatar
    gana October 24, 2017 at 3:13 pm #

    Thank you for your precious tutorials

    I have a question about confusion between test set and validation set in tensorflow+keras.

    In your tutorials validation dataset does not affect to training and totally independent from training procedure.
    validation dataset is only for monitoring and early stopping. in addition, validation set is not used in training (updating weight, gradient decent, etc).

    However i found that wikipedia says in different way as follows:
    A validation dataset is a set of examples used to tune the hyperparameters (i.e. the architecture) of a classifier.
    LINK: https://en.wikipedia.org/wiki/Training,_test,_and_validation_sets

    It sounds like validation dataset is for tuning parameters means that it is used in training procedure.
    If it is true we will face overfitting.

    if validation data set does not affect to training as your tutorial then does keras use some part of training dataset automatically for validating and tuning parameters?

    • Avatar
      Jason Brownlee October 24, 2017 at 4:02 pm #

      Yes, validation set is used for tuning the model and is a subset of the training dataset.

      Perhaps this post will clear things up:

      • Avatar
        gana October 24, 2017 at 6:28 pm #

        Thank you Mr @Jason Brownlee for your quick answer

        Let me clear my question again, i am asking this question not because i do not know the concept of three datasets.
        it is because i do not know how background of Keras use the datasets.

        For example. in this tutorilal https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/

        You defined only two datasets as:

        (X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=top_words)

        You said that validation dataset may disappear if there is K-fold validation (in that case validation is picked from training set), however in the tutorial we did not use k-fold validation. So where is validation set? is it still in the training set?

        in the code below, fit function uses validation_data for tuning parameters isn’t it? and also you assigned test data to validation data. in that case we need new test data for evaluation, is it right?

        model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=3, batch_size=64)

        in the code below, evaluate function results unbiased score isn’t it? then where is validation data? does keras background code automatically split X_train to train and validation parts?

        model.fit(X_train, y_train, nb_epoch=3, batch_size=64)
        # Final evaluation of the model
        scores = model.evaluate(X_test, y_test, verbose=0)

        • Avatar
          Jason Brownlee October 25, 2017 at 6:43 am #

          We do not have to use a validation dataset and in many tutorials I exclude that part of the process for brevity.

          • Avatar
            gana October 25, 2017 at 1:14 pm #

            Means that keras picks part of training dataset automatically for validating and tuning parameters?
            If we do not use validation dataset how to tune parameters?

          • Avatar
            Jason Brownlee October 25, 2017 at 4:03 pm #

            It can, or we can specify it.

            You can tune on the training dataset.

  27. Avatar
    Estelle October 27, 2017 at 7:56 am #

    Hi Jason,

    Thank you very much for your blog and examples, it is great!

    Look I merged two of your examples: the one above and the Save and Load Your Keras Deep Learning Models (https://machinelearningmastery.com/save-load-keras-deep-learning-models/). You can see the code below:

    # MLP for Pima Indians Dataset with 10-fold cross validation
    from keras.models import Sequential
    from keras.layers import Dense
    from sklearn.model_selection import StratifiedKFold
    from keras.models import model_from_json
    import numpy
    # fix random seed for reproducibility
    seed = 7
    # load pima indians dataset
    dataset = numpy.loadtxt(“pima-indians-diabetes.csv”, delimiter=”,”)
    # split into input (X) and output (Y) variables
    X = dataset[:,0:8]
    Y = dataset[:,8]

    #make model
    def make_model():
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation=’relu’))
    model.add(Dense(8, activation=’relu’))
    model.add(Dense(1, activation=’sigmoid’))
    # Compile model
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    # define 10-fold cross validation test harness
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
    cvscores = []
    for train, test in kfold.split(X, Y):
    # create model
    model = make_model()
    # Fit the model
    model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)
    # serialize model to JSON
    model_json = model.to_json()
    with open(“model.json”, “w”) as json_file:
    # serialize weights to HDF5
    print(“Saved model to disk”)

    # evaluate the model
    scores = model.evaluate(X[test], Y[test], verbose=0)
    print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))

    cvscores.append(scores[1] * 100)

    del model_json
    del model
    print(“%.2f%% (+/- %.2f%%)” % (numpy.mean(cvscores), numpy.std(cvscores)))

    # load json and create model
    json_file = open(‘model.json’, ‘r’)
    loaded_model_json = json_file.read()
    loaded_model = model_from_json(loaded_model_json)
    # load weights into new model
    print(“Loaded model from disk”)

    # evaluate loaded model on test data
    loaded_model.compile(loss=’binary_crossentropy’, optimizer=’rmsprop’, metrics=[‘accuracy’])
    score = loaded_model.evaluate(X, Y, verbose=0)
    print(“%s: %.2f%%” % (loaded_model.metrics_names[1], score[1]*100))

    And the output is as follow:

    Saved model to disk
    acc: 76.62%
    Saved model to disk
    acc: 74.03%
    Saved model to disk
    acc: 71.43%
    Saved model to disk
    acc: 72.73%
    Saved model to disk
    acc: 70.13%
    Saved model to disk
    acc: 64.94%
    Saved model to disk
    acc: 66.23%
    Saved model to disk
    acc: 64.94%
    Saved model to disk
    acc: 63.16%
    Saved model to disk
    acc: 72.37%
    69.66% (+/- 4.32%)

    Loaded model from disk
    acc: 75.91%

    Naively I was expecting to get the save accuracy as the last model I saved (which was 72.37%), but I got 75.91%. Could you please explain how the weights are saved inside a k-fold cross validation?



  28. Avatar
    Kongpon December 12, 2017 at 1:38 am #

    Hi Jason,

    I try to use your example in my Classification Model, but I got this Error

    ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘multilabel-indicator’ instead.

  29. Avatar
    AJ December 19, 2017 at 9:27 pm #

    Sir, can you sugest how to do startified kfol crossvalidation for this case.

  30. Avatar
    Sam Miller January 5, 2018 at 1:28 am #

    HI Jason,
    How do you apply model.predict when using k-folds?
    I want to be able to create a classification report for my model and possibly an AUC plot, heat map, precision-recall graph as well

    • Avatar
      Jason Brownlee January 5, 2018 at 5:27 am #

      Good question, see this post on creating a final model:

      • Avatar
        Sam Miller January 6, 2018 at 2:19 am #

        Sorry what I meant was, how do you code model.predict/predict_proba when you use the kfold.split method?

        The examples I’ve seen that don’t use k-folds have code like model.predict(x_test) after applying model.fit.

        I’d like to use precision_recall_curve and roc_curve with k-folds

        • Avatar
          Jason Brownlee January 6, 2018 at 5:54 am #

          You do not. CV is for evaluating a model, then you can build a final model. Please see the post that I linked.

  31. Avatar
    raj kumar January 26, 2018 at 5:20 pm #

    We are doing 10-fold cross validation on some optical character recognition data set. we used kfold, kerasclassifier functions. A snap shot of our output is

    18000/18000 [==============================] – 1s – loss: 0.5963 – acc: 0.8219
    Epoch 144/150
    18000/18000 [==============================] – 1s – loss: 0.5951 – acc: 0.8217
    Epoch 145/150
    18000/18000 [==============================] – 1s – loss: 0.5941 – acc: 0.8219
    Epoch 146/150
    18000/18000 [==============================] – 1s – loss: 0.5928 – acc: 0.8225
    Epoch 147/150
    18000/18000 [==============================] – 1s – loss: 0.5908 – acc: 0.8234
    Epoch 148/150
    18000/18000 [==============================] – 2s – loss: 0.5903 – acc: 0.8199
    Epoch 149/150
    18000/18000 [==============================] – 1s – loss: 0.5892 – acc: 0.8217
    Epoch 150/150
    18000/18000 [==============================] – 1s – loss: 0.5917 – acc: 0.8235
    1720/2000 [========================>…..] – ETA: 0sAccuracy: 81.24% (0.94%)

    How to interpret this output?

    • Avatar
      Jason Brownlee January 27, 2018 at 5:55 am #

      It shows the progress (epoch number n of m), the loss (minimizing) and the accuracy (maximizing).

      What is the problem exactly?

  32. Avatar
    Anton January 30, 2018 at 8:07 am #

    Hello Jason,
    Did i understand correctly: using K-folds won’t necessarily increase the accuracy of the model but instead give a more realistic (or “accurate”) accuracy rate? I’m getting a slightly lower score or accuracy on my model which uses your example of kfolds, vs. a very simple model that simply evaluated on test data
    scores = model.evaluate(X_test, Y_test)

    • Avatar
      Jason Brownlee January 30, 2018 at 9:59 am #

      Yes, the idea is that k-fold cross validation provides a less biased estimate of the skill of your model on unseen data.

      This is on average.

      A difficult problem or a bad k value can still give poor skill estimates.

  33. Avatar
    Maryam March 3, 2018 at 6:51 am #

    Hi Jason,
    I appreciate your tutorial. tell you the truth I want to write k-fold cross validation from scratch for the first time, but I do not know which tutorial teach novice student better. I want to write k- fold cross validation for lstm, Rnn, cnn.
    would u please recommend me which link is the best one to do the issue??



    4- https://www.kaggle.com/stefanie04736/simple-keras-model-with-k-fold-cross-validation

    If there are any better tutarial link to teach k-fold cross validataion for deep learning function in keras with tensorflow, please introduce to us.
    any answers will be apprecuated.
    Best wishes

    • Avatar
      Jason Brownlee March 3, 2018 at 8:19 am #

      Choose a tutorial that teaches in a style that suits you.

      • Avatar
        Maryam March 3, 2018 at 10:47 am #

        Tell u the truth they are different with each other and I do not know which is proper for writing k-fold cross validation for RNN,CNN, Lstm?
        the written codes for k-fold cross validation are different.
        please show me which code will work fine for the issue?
        Thank u

        • Avatar
          Jason Brownlee March 4, 2018 at 5:59 am #

          For RNNs, you may want to use walk-forward validation instead of k-fold cross validation:

          • Avatar
            Maryam March 5, 2018 at 12:26 am #

            Hi Jason,
            I am grateful for the link. but I have a sentiment analysis binary classification. the given link is for Time Series Forecasting which is not my issue.
            I do not know which codes are proper for writing k-fold cross-validation for binary sentiment analysis(text classification) as codes are different and I have not seen a k-fold cross-validation code for cnn or lstm.
            Would u please introduce me a similar sample code?
            Any replying will be appreciated.

          • Avatar
            Jason Brownlee March 5, 2018 at 6:25 am #

            I would recommend scikit-learn for cross validation of deep learning models, if you have the resources.

            As in the link I provided.

            The specific dataset used in the example is irrelevant, you are interested in the cross validation.

            I cannot write the code for you. You have everything you need.

  34. Avatar
    Maryam March 4, 2018 at 4:02 am #

    Hi Jason,
    I should appreciate the tutorial but when I copy your code and paste it into my spyder, it gave me error in this command line “model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’]).
    the error is this: model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    IndentationError: unindent does not match any outer indentation level.
    but I am sure I have written codes as the same as yours.

    To solve the problem I remove the indent and written the code as below but gave me just the final result=acc: 64.47%..
    But I want to give each fold’s results like yours:
    acc: 77.92%
    acc: 68.83%
    acc: 72.73%
    acc: 64.94%
    acc: 77.92%
    acc: 35.06%
    acc: 74.03%
    acc: 68.83%
    acc: 34.21%
    acc: 72.37%
    64.68% (+/- 15.50%)
    when I remove the indent space the code just gives me this result:acc: 64.47%
    64.47% (+/- 0.00%)
    My own written code after removing unindent is this:

    for train, test in kfold.split(X, Y):
    # create model
    model = Sequential()
    model.add(Dense(12, input_dim=8, activation=’relu’))
    model.add(Dense(8, activation=’relu’))
    model.add(Dense(1, activation=’sigmoid’))
    # Compile model
    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    model.fit(X[train], Y[train], epochs=15, batch_size=10, verbose=0)
    # evaluate the model
    scores = model.evaluate(X[test], Y[test], verbose=0)
    print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    print(“%.2f%% (+/- %.2f%%)” % (numpy.mean(cvscores), numpy.std(cvscores)))

    I just write the section that i have chaned= removing indent.
    the differents between mine and yours are just remiving indent for these 5 command lines which are :

    model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])

    model.fit(X[train], Y[train], epochs=150, batch_size=10, verbose=0)

    scores = model.evaluate(X[test], Y[test], verbose=0)
    print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)

    what is the reason which cause me this error (model.compile(loss=’binary_crossentropy’, optimizer=’adam’, metrics=[‘accuracy’])
    IndentationError: unindent does not match any outer indentation level) when i write code as yours.
    how can i fix the error?
    I really need your help.
    Sorry for noting long.
    Best REgards

    • Avatar
      Jason Brownlee March 4, 2018 at 6:05 am #

      Ensure that the code is on one line, e.g. this is a python syntax error.

      • Avatar
        Maryam March 5, 2018 at 7:19 am #

        It is working fine now.
        Thank u.

  35. Avatar
    SUH March 6, 2018 at 1:56 am #

    Hi, Jason. Thank you for your tutorial. I’d like to apply the KStratifiedFold to my code using Keras, but I don’t know how to do it. This is based on the tutorial from the Keras blog post ”
    Building powerful image classification models using very little data”. In here, the author of the code uses the ‘fit_generator’, instead of ‘X = dataset[:,0:8], Y = dataset[:,8]’

    How can I make this work? I’ve been scratching my head for weeks, but I am out of idea…

    I’m open to all suggestions and any answers will be appreciated.



    p.s: Here’s the full code.

    • Avatar
      Jason Brownlee March 6, 2018 at 6:17 am #

      Sorry, I cannot debug your code for you. Perhaps post to stackoverflow.

  36. Avatar
    mark March 7, 2018 at 4:42 am #

    thanks for your hard work. every time I have a question google usually sent me to your site.

    I came here because I googled ” how to calculate AUC with model.fit_generator” I found that one of your reader had similar issue but you only use “ImageDataGenerator”to augment.
    I tried
    from sklearn.metrics import roc_curve, auc

    y_pred = InceptionV3_model.predict_generator(test_generator,2000 )
    but dont know how to get y_test
    roc_auc_score(y_test, y_pred)

    • Avatar
      Jason Brownlee March 7, 2018 at 6:17 am #

      y_test would be the expected outputs for the test dataset.

  37. Avatar
    CHIRANJEEVI March 8, 2018 at 5:12 am #

    HI Jason
    please elaborate the difference between the validation,training_score and test score?

  38. Avatar
    alice March 11, 2018 at 6:11 pm #


    Many Thanks for this great post.. learnt a lot…I followed the 5 fold cross validation approach for my dataset, that contains 2000 posts and used 25 epochs.. The accuracy keep increasing, after every fold and finally reached more than 97%… But, in your blog, the acc either increases or decreases in all the folds.. could you please explain the reason, why my results are different…

    Accuracy: : 92.18%

    Accuracy: : 95.84%

    Accuracy: : 98.03%

    Accuracy: : 99.26%

    Accuracy: : 100.00%

    Accuracy of 5-Fold Cross Validation with standard deviation:

    97.21% (+/- 2.57%)

  39. Avatar
    Mary March 27, 2018 at 3:51 pm #

    Hi, Jason
    I’m new in python and deep learning machine
    thank you for your all tutorials I learnt too much so far
    I have question can you explain to me simply what does it mean
    what is val_acc , loss and val-loss in the model what it does tell
    I read many comments and articles but I could not get it

    poch 145/150
    514/514 [==============================] – 0s – loss: 0.4847 – acc: 0.7704 – val_loss: 0.5668 – val_acc: 0.7323
    Epoch 146/150
    514/514 [==============================] – 0s – loss: 0.4853 – acc: 0.7549 – val_loss: 0.5768 – val_acc: 0.7087
    Epoch 147/150
    514/514 [==============================] – 0s – loss: 0.4864 – acc: 0.7743 – val_loss: 0.5604 – val_acc: 0.7244
    Epoch 148/150
    514/514 [==============================] – 0s – loss: 0.4831 – acc: 0.7665 – val_loss: 0.5589 – val_acc: 0.7126
    Epoch 149/150
    514/514 [==============================] – 0s – loss: 0.4961 – acc: 0.7782 – val_loss: 0.5663 – val_acc: 0.7126
    Epoch 150/150
    514/514 [==============================] – 0s – loss: 0.4967 – acc: 0.7588 – val_loss: 0.5810 – val_acc: 0.6929

    Thank you once again

    • Avatar
      Jason Brownlee March 27, 2018 at 4:20 pm #

      val_loss is the calculated loss on the validation dataset.
      val_acc is the calculated accuracy on the validation dataset.

      They are different from loss and acc that are calculated on the training dataset.

      Does that help?

      • Avatar
        Mary March 28, 2018 at 12:25 pm #

        Yes Thank you 🙂

  40. Avatar
    Harrison April 5, 2018 at 9:09 am #

    Hello, thanks a lot for this tutorial i have been searching for something like this, i am glad i finally found it. but i would like to know how i can visual the training of this neural network in the example, in one of your tutorial i could plot “val_acc” against “acc” but i can not do same here because there is no “val_acc” here in k fold validation. so please how do i do this here if i evaluate with k fold validation. thank you

  41. Avatar
    John April 5, 2018 at 4:22 pm #


    Thanks for the tutorial, very helpful. I used the validation_data approach, and it seems to be working and producing different accuracies for each epoch (presumably between train and validation), but it gives a puzzling statement before the model starts:

    “Train on 15,000 samples, validate on 15,000 samples.”

    Does that mean I messed up and fed it the same data for both train and validation or am I ok? In my case, the train has 15,000 samples and the validation file has 10,000 samples.

    Thanks so much!

    • Avatar
      Jason Brownlee April 6, 2018 at 6:21 am #

      Perhaps confirm the size/shape of the train and validation dataset, just to make sure you have set things up the way that you expect.

      • Avatar
        John April 6, 2018 at 2:44 pm #

        Yeah, it is clearly reading in the train dataset twice. Thanks!

  42. Avatar
    Don April 13, 2018 at 3:46 am #

    Thank you for this. How can I change the manual Kfold cross validation to work for multiclass? Say Y was a 100 by 4 array of zeros and ones?

  43. Avatar
    Victor Vargas April 16, 2018 at 10:21 am #

    If the goal of a training phase is to improve our model acc and reduce model’s loss on every epoch, why do you create and compile a new model on every fold iterarion?

    • Avatar
      Jason Brownlee April 16, 2018 at 2:58 pm #

      Great question.

      We want to know how skillful the model is on average, when trained on a random sample from the domain, and making predictions on other data in the domain.

      To calculate this estimate, we use resampling methods like k-fold cross-validation that make good economical use of a fixed sized training dataset.

      Once we select a model+config that has the best estimated skill, we can then train it on all of the available data and use it to start making predictions.

      Does that help Victor?

  44. Avatar
    Nicholas Angelucci July 4, 2018 at 12:19 am #

    Is it possible to plot the accuracy and loss graphs (history) resulting from a validation made with the StratifiedKFold class?

    • Avatar
      Jason Brownlee July 4, 2018 at 8:24 am #

      Yes, but you will have one plot per fold. You may also have to iterate the folds manually to capture the history and plot it.

  45. Avatar
    Nicholas Angelucci July 4, 2018 at 7:20 pm #

    Yes, I am able to make a plot per fold, but i want to make only two graphs, one with the accuracy mean and the other with mean loss for k-fold.

    • Avatar
      Jason Brownlee July 5, 2018 at 7:40 am #

      You can create two plots and add a line to each plot for each fold.

  46. Avatar
    Tien Wang July 7, 2018 at 8:13 am #

    I found that if you training data input “X” is a Pandas DataFrame, then you will have to use X.loc[train] to have it work. Otherwise, the indexing of a DataFrame directly supplying numpy array (that the unpack of kfold.split are numpy arrays) will throw KeyError.

    • Avatar
      Jason Brownlee July 8, 2018 at 6:15 am #

      In the above tutorial we are loading data as a NumPy array.

  47. Avatar
    kuda July 9, 2018 at 10:26 pm #

    Hi Doc. Thanks for your examples, they are straight forward. Can you do the pima implementation of cnn in r?

  48. Avatar
    Abdur Rahaman July 30, 2018 at 6:30 am #

    I have problem printing the confusion martix, classification report and draw AUC curve.
    It will be helpful if you provide me how I can print confusion matrix, classification report and draw AUC curve in Neural network using keras in 10 fold cross validation.


  49. Avatar
    Alay August 16, 2018 at 7:54 am #

    Hi Jason..
    Thank you so much for all tutorials.

    How can use Cross validation with flow_from_directory and fit_generator ??

    • Avatar
      Jason Brownlee August 16, 2018 at 1:57 pm #

      Sorry, I don’t have an example of this combination.

  50. Avatar
    Samin September 12, 2018 at 2:44 am #

    Hi Jason,

    Thanks for your fantastic posts about machine learning.

    In your cross-validation code, you used 150 epochs, but the results show only 1 full round of 10-fold cross validation. You only printed the outputs for first epoch?

    I really have problems understanding cross validation and epochs. Here is how I understood the code:
    is it like each epoch consists of one full run of cross validation? So, in first epoch we run 10-fold cross validation and report the average validation error, in second epoch we continue with the model prepared in first epoch and again run k-fold cross validation and so on till epoch 150? And about the mini-batches, is it like that we divide each training set to 10 mini-batches to do forward backward pass, and after completing the training on whole 10 batches, we refer to validation set to estimate validation error?

    Could you tell me whether I perceived the concept of cross validation correctly or not?


  51. Avatar
    Akim Borbuev October 1, 2018 at 4:06 am #

    Hello Jason,

    First of all, thank you for you comprehensive and insightful tutorials, your website is my first stop when I look for the answers. I have implemented a deep neural network for time series forecast. The problem I am facing is that my fitting and evaluating part go well and I have somewhere around 0.8 accuracy for both of them. But when I try to predict for a new batch of samples, I get zeros all over my predictions. I would be very grateful if you can suggest me where to start debugging me model. I can also send my model if you would like to take a look it.
    Thank you.


  52. Avatar
    Okocha October 11, 2018 at 7:02 pm #

    Should not we set_learning_phase to 0 right before calling the evaluate function? According to keras documentation while doing inference we should set it to 0.

    • Avatar
      Jason Brownlee October 12, 2018 at 6:37 am #

      Perhaps it is set automatically when evaluate() is called?

  53. Avatar
    Manikandan Sathiyanarayanan October 26, 2018 at 12:28 am #

    hi .i would like to know some basic things about validation data. shall i keep 0.2 % percentage of total training data in validation data folder manually without use validation data split function . will it useful while training model

  54. Avatar
    ipek October 31, 2018 at 1:49 am #

    Why do you create model for each iteration in kfold instead of creating model once and then calling model.fit in iterations?

    • Avatar
      Jason Brownlee October 31, 2018 at 6:29 am #

      A new set of weights is required to fit a new model on each iteration.

  55. Avatar
    Effe December 2, 2018 at 3:43 am #

    ‘Hi! How can I use K-fold validation in multiple label problems?

    • Avatar
      Jason Brownlee December 2, 2018 at 6:22 am #

      Directly, no change. What problem are you having exactly?

  56. Avatar
    Joseph December 7, 2018 at 4:39 am #

    Hi Jason,

    Great article first of all!

    In your example you perform model compilation within each fold. That is very slow. I am wondering whether I will achieve the same results if I move the model compilation outside of the for loop like the folllowing?

    I guess what I am really asking is, once I’ve initialized and compiled the model, will each call to model.fit() perform an independent fitting using the current folds of training and validation data set without being interfered by weights obtained from the last loop?

    If yes, then I suppose it’s faster to do my version of the code as the result will be the same.
    If no, and if I still want to only initialize and compile the model once before the for loop, is there any way to reset the model after each model.fit()?

    Many thanks

    • Avatar
      Jason Brownlee December 7, 2018 at 5:26 am #

      To be safe, I’d rather re-define and re-compile the model each loop to ensure that each iteration we get a fresh initial set of weights.

  57. Avatar
    sanghita December 21, 2018 at 3:27 pm #

    How to modify the below code for plotting graph with k-fold cross validation?

    def build_classifier():
    classifier = Sequential()
    classifier.add(Dense(units = 6, kernel_initializer = ‘uniform’, activation = ‘relu’, input_dim = 13))
    classifier.add(Dropout(p = 0.1))
    classifier.add(Dense(units = 6, kernel_initializer = ‘VarianceScaling’, activation = ‘relu’))
    classifier.add(Dense(units = 6, kernel_initializer = ‘VarianceScaling’, activation = ‘relu’))
    classifier.add(Dense(units = 3, kernel_initializer = ‘VarianceScaling’, activation = ‘softmax’))
    classifier.compile(optimizer = ‘adam’, loss = ‘categorical_crossentropy’, metrics = [‘accuracy’])
    return classifier
    classifier = KerasClassifier(build_fn = build_classifier, batch_size = 5, epochs = 200, verbose=1)
    accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10, n_jobs = 1)
    mean = accuracies.mean()
    variance = accuracies.std()

    Thanks and regards

    • Avatar
      Jason Brownlee December 22, 2018 at 6:01 am #

      For k-fold cross-validation, I would recommend plotting the distribution scores across the folds, e.g. with a box and whisker plot.

      Collect the scores in a list and pass them to pyplot.boxplot()

  58. Avatar
    ammara December 31, 2018 at 4:24 pm #

    Hi, jason
    What is the limitation/disadvatage of using a manual verification dataset.

  59. Avatar
    Partha Nayak February 19, 2019 at 3:33 am #

    Can I write like this:

    result = KerasClassifier(build_fn=baseline_model, epochs=200, batch_size=5, verbose=0)

    and then:


    so that result can be used for kfold validation of scikit-learn as well as confusion matrix display of Keras?

  60. Avatar
    Gabby February 26, 2019 at 8:44 pm #

    Im just wondering why you use X_test,y_test for validation_data?

    • Avatar
      Jason Brownlee February 27, 2019 at 7:25 am #

      An easy shortcut for the tutorial. Ideally we would use a separate dataset.

  61. Avatar
    Meisam March 4, 2019 at 9:50 pm #

    Shouldn’t each fit during cross-validation be saved so that the best fit can be used later?

  62. Avatar
    Jacob MB March 16, 2019 at 12:03 am #

    Dear Jason
    Some Question here that in my eyes make little sense:

    So far, I have produced the following code in python using Keras with Tensorflow backend (1 batch, sequence of 1).

    #Define model
    model = Sequential()
    model.add(LSTM(128, batch_size=BATCH_SIZE, input_shape=(train_x.shape[1],train_x.shape[2]), return_sequences=True, stateful=False ))#,,return_sequences=Tru# stateful=True
    model.add(Dense(2, activation=’softmax’))
    opt = tf.keras.optimizers.Adam(lr=0.01, decay=1e-6)

    #Compile model

    train_x, train_y,

    #Now I want to make sure that the we can predict the training set (using evaluate) and that it is the same result as during training
    score = model.evaluate(train_x, train_y, batch_size=BATCH_SIZE, verbose=0)
    print(‘ Train accuracy:’, score[1])

    The Output of the code is

    Epoch 1/10 5872/5872 [==============================] – 0s 81us/sample – loss: 0.6954 – acc: 0.4997
    Epoch 2/10 5872/5872 [==============================] – 0s 13us/sample – loss: 0.6924 – acc: 0.5229
    Epoch 3/10 5872/5872 [==============================] – 0s 14us/sample – loss: 0.6910 – acc: 0.5256
    Epoch 4/10 5872/5872 [==============================] – 0s 13us/sample – loss: 0.6906 – acc: 0.5243
    Epoch 5/10 5872/5872 [==============================] – 0s 13us/sample – loss: 0.6908 – acc: 0.5238

    Train accuracy: 0.52480716

    So the problem is that the final modeling accuracy (0.5238) should be equal (evaluation) accuracy (0.52480716) which it is not. This makes no sense, why cant we use evaluate on our train data and then obtain the same result as during training? There are no dropouts or anything that should make training different from evaluation. The same happens if I use a validation set

    • Avatar
      Jason Brownlee March 16, 2019 at 7:53 am #

      The score during training is estimated across the batches I believe, and is reported before weight updates.

      You could use early stopping to save the weights for the model after a specific batch I believe. Perhaps a custom early stopping/checkpoint callback?

  63. Avatar
    Tiger229 March 16, 2019 at 12:05 am #

    hi Jason ,
    thank you for the helpful tutorial .
    I have trained keras model for semantic segmentation with ICNET .
    how could evaluate my trained model with mIOU (Mean intersection over union ) on the validation set ? any tips or useful articles

    • Avatar
      Jason Brownlee March 16, 2019 at 7:54 am #

      Sorry, I don’t have a tutorial on calculating mIOU.

  64. Avatar
    Eyitayo March 28, 2019 at 1:42 am #

    Hello Jason,

    Thank you for the detailed work.

    Two quick quetstions:

    1) Does it suffice to say that a CVscores.mean of 0.78 has better accuracy than a CVscores.mean of 0.58.
    2) What is the implication of former?


  65. Avatar
    Tom F March 29, 2019 at 2:05 am #

    Mr. Brownlee,

    Thank you much for your efforts. Very helpful.

    Maybe I’m overthinking but once I get a model with good results it is wrapped into a funtion so I can’t call it from a console. I’m used to using “model.predict(x)”. However, with this code, I get “‘model’ is not defined”. Is finallizing just copying the final model definition to the prompt, compiling it, then fitting it on the learning data, and predicting the unknown data?

    Will your code produce the same results as defining the model at the prompt then running it through a KFold loop?

    Thanks again!

  66. Avatar
    Neel April 19, 2019 at 11:52 pm #

    Hi Jason,

    Since K fold Cross Validation RANDOMLY splits the data (3,5,10..etc) and then trains the model, is it recommended to use GridSearch / K fold on time series data (Multi Classification using LSTM) ? Because the moment the time series data is randomised, LSTM would loose its meaning right?

  67. Avatar
    MAK May 22, 2019 at 6:15 am #

    I see you split the data in the k-fold manner via scikti tools for get more accurate estimation .
    My question is, there are build in function for doing the k-fold split in temporal domain (for example stock price), when there are meaning to the order of the sample

  68. Avatar
    Arjun June 17, 2019 at 4:02 am #

    Hi Jason,

    Great tutorial.
    I am facing an issue where my loss is not getting decreased.
    I used categorical labels but StratifiedKFold threw an error which led to convert the labels to numerical.

    Modified code:

    input_img = Input(shape = (242, 242, 1))

    kfold = StratifiedKFold(n_splits=6, shuffle=True, random_state=13)
    cvscores = []
    for train, test in kfold.split(images, labels):
    model = Model(input_img, model(input_img))
    # Compile model
    model.compile(loss=’binary_crossentropy’, optimizer = Adam(), metrics=[‘accuracy’])
    # Fit the model
    #labels = to_categorical(labels)
    model.fit(images[train], labels[train], epochs=epochs, batch_size=batch_size, verbose=1)
    # evaluate the model
    scores = model.evaluate(images[test], labels[test], verbose=1)
    print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))
    cvscores.append(scores[1] * 100)
    print(“%.2f%% (+/- %.2f%%)” % (numpy.mean(cvscores), numpy.std(cvscores)))


    Epoch 1/100
    4000/4000 [==============================] – 8s 2ms/step – loss: 7.1741 – acc: 0.5500
    Epoch 2/100
    4000/4000 [==============================] – 4s 917us/step – loss: 7.1741 – acc: 0.5500
    Epoch 3/100
    4000/4000 [==============================] – 4s 922us/step – loss: 7.1741 – acc: 0.5500
    Epoch 4/100
    4000/4000 [==============================] – 4s 955us/step – loss: 7.1741 – acc: 0.5500
    Epoch 5/100
    4000/4000 [==============================] – 4s 950us/step – loss: 7.1741 – acc: 0.5500
    Epoch 6/100
    4000/4000 [==============================] – 4s 948us/step – loss: 7.1741 – acc: 0.5500
    Epoch 7/100
    4000/4000 [==============================] – 4s 918us/step – loss: 7.1741 – acc: 0.5500
    Epoch 8/100
    4000/4000 [==============================] – 4s 918us/step – loss: 7.1741 – acc: 0.5500

    The loss doesnt seem to decrease , could you let me know where I am going wrong.


  69. Avatar
    Namthy July 29, 2019 at 3:10 am #


    To my understanding, Cross validation is a method to help choose a model (or) its hyper parameters.

    Like if i have

    Neural Network 1 with 10 layers
    Neural Network 2 with 100 layers

    It can guide me choosing either 1 or 2

    After identifying the network, to get my final model i should use entire raining set, run it for certain number of epochs and choose the parameters which provided good accuracy during the runs

    Is my presumption correct? kindly clarify this doubt?!

    • Avatar
      Jason Brownlee July 29, 2019 at 6:19 am #

      Almost. CV is only used to estimate the performance of the model.

      You must then interpret the estimated performance of each model/config and choose.

      You can do this directly or use statistical hypothesis testing methods, or other methods.

      Yes, afterward, you fit on all data and start using the model.

  70. Avatar
    EL BOUNY July 29, 2019 at 8:35 pm #

    Hi Mr. Jason

    Firstly thank you for your very well explained tutorials

    I would like to know how to compute the overall confusion matrix of a deep learning model when using the k-fold cross validation ?


    • Avatar
      Jason Brownlee July 30, 2019 at 6:10 am #

      You cannot.

      A confusion matrix is for a single run only.

      Cross validation estimates model performance over multiple runs.

      • Avatar
        EL BOUNY July 30, 2019 at 10:05 am #

        My purpose is that if I can get the confusion matrix at each fold, then the overall

        confusion matrix can be obtained as the sum of all confusion matrices resulting from all

        folds. In fact, the performance measure (i.e. accuracy) of the model is the average value

        across all folds. Thus, by summing all confusion matrices, the accuracy of the model can

        be computed as the ratio between the sum of the diagnoal elements of the resulted

        confusion matrix and the sum of all elements.

        Thanks again!

        • Avatar
          EL BOUNY July 30, 2019 at 10:13 am #

          The confusion matrix at each fold is computed only based on the results of the model on the test set. I have asked this query because I have founded in various works in my field that the authors use the k-fold cross validation to evaluate the model, and in the same time they draw the confusion matrix of the model.

          • Avatar
            Jason Brownlee July 30, 2019 at 2:08 pm #

            I would expect the confusion matrix is reported based on a standard test set for the dataset.

        • Avatar
          Jason Brownlee July 30, 2019 at 2:07 pm #

          I would not recommend it as each cell of the matrix would need to report mean and variance. It would be confusing.

          • Avatar
            EL BOUNY July 30, 2019 at 7:45 pm #

            Thanks a lot. In this case, you have an idea on how to draw the total confusion matrix ?. Because, In a numerous papers that I reads, the authors have used the k-cross validation and they have designed jointly the confusion matrix, for example they write as title “confusion matrix obtained using 10-fold cross validation”.

            Note : The sum of all elements of the matrix is equal to the size of the overall dataset.


  71. Avatar
    Ziko July 31, 2019 at 11:38 pm #

    Hi Jason, thx a lot i’m getting lots of help from your tutorials, i’m new for both python and machine learning.
    i took your code sample for strat k-fold here and used it with some changes on my data and got good results.
    i am saving the best fitted model from the k-fold for future predictions. my question is how to save average model or some how to save the entire model.
    once again thx a lot
    my code: sorry i’m sure how to place it in as a code

    seed = 6

    # split into input (X) and output (Y) variables
    X = train_inputs.copy()
    Y = train_targ.copy()
    # define 10-fold cross validation test harness
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
    cvscores = []
    model1 = Sequential()
    for train, test in kfold.split(X, Y):
    # create model
    model = Sequential()
    model.add(Dense(128, input_dim=8, activation=’relu’))
    model.add(Dense(128, activation=’relu’))
    model.add(Dense(1, activation=’sigmoid’))
    # Compile model
    model.compile(loss=’binary_crossentropy’, optimizer=adam1, metrics=[‘accuracy’])
    # Fit the model
    history = model.fit(X[train], Y[train], \
    validation_data=(X[test], Y[test]), \
    epochs=150,batch_size=32, verbose=2)
    _, train_acc = model.evaluate(X[train], Y[train], verbose=0)
    _, test_acc = model.evaluate(X[test], Y[test], verbose=0)
    print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc))
    if test_acc>accc_tst and train_acc>accc_trn:

    # plot history
    plt.plot(history.history[‘acc’], label=’train’)
    plt.plot(history.history[‘val_acc’], label=’test’)
    cvscores.append(test_acc * 100)

    print(“%.2f%% (+/- %.2f%%)” % (np.mean(cvscores), np.std(cvscores)))
    new_model = models.load_model(‘my_class.h5’)
    _, test_newacc = new_model.evaluate(valid_inputs, valid_targ, verbose=0)
    print(‘Test: %.3f’ % (test_newacc))

  72. Avatar
    Ziko August 3, 2019 at 8:33 pm #

    Hi Jason thx a lot
    I have another question concerning strat-k-fold, i read at some places that unbalanced inputs cause model to be fitted to produce unbalance predictions (whether binary or multi categories ) therefore it’s recommended to drop inputs in order to train the model better on balanced data, if i understood correctly from you the way strat-k-fold works is taking into consideration the unbalanced nature of the data but i’m not sure what is better: train a strat-k-fold model with unbalanced input that actually represent the population or balance the data before training.

  73. Avatar
    Ziko August 4, 2019 at 4:49 pm #

    Hi again, maybe i will try to generalized my question, if i train a model with a certain proportion, does the model actually capture the proportion of the data and therefore will be better to predict data with similar proportion? so if i suspect for a future data to be with different proportion, i should prepare training data with more closely proportion to the future data?
    so if the model does capture training proportion and i have a data that change proportion with time or by any other matter i should train it again for different proportions?

    • Avatar
      Jason Brownlee August 5, 2019 at 6:47 am #

      It tries to.

      Yes, a given data should be generally representative of the problem.

  74. Avatar
    crakama August 7, 2019 at 1:57 am #

    Hi Jason. Does use of ” Manual Verification Datasets.” help prevent overfitting ? or K-Fold is much better ?

    • Avatar
      Jason Brownlee August 7, 2019 at 8:02 am #

      It really depends.

      Validation dataset is really good for hyperparameter tuning.

  75. Avatar
    ziko August 25, 2019 at 7:44 pm #

    hello, Jason.
    using keras model i get zero accuracy for perfectly linear relation of output vs input, i’m not sure if i interpreted wrongly the accuracy or doing something wrong with my code any help will be appreciated

    i’v tried adding more layers, more epochs and so on nothing changed

    import numpy as np
    import matplotlib.pyplot as plt
    import tensorflow as tf

    from keras import models
    from keras.models import Sequential
    from keras.layers import Dense, Dropout
    from keras import optimizers
    from sklearn.model_selection import KFold

    from sklearn.preprocessing import MinMaxScaler
    from keras.optimizers import SGD

    inp=np.random.randint(100, 1000000 , size=[siz,1])
    normzer_inp = MinMaxScaler()
    inp_norm = normzer_inp.fit_transform\
    normzer_uop = MinMaxScaler()
    uop_norm = normzer_uop.fit_transform\


    kfold = KFold(n_splits=2, random_state=None, shuffle=False)
    cvscores = []
    opti_SGD = SGD(lr=0.01, momentum=0.9)
    model1 = Sequential()

    for train, test in kfold.split(X, Y):
    model = Sequential()
    model.add(Dense(16, input_dim=X.shape[1], activation=’relu’))
    model.add(Dense(16, activation=’relu’))

    model.add(Dense(1, activation=’linear’))
    model.compile(loss=’mean_squared_error’, optimizer=opti_SGD,\

    history = model.fit(X[train], Y[train], \
    validation_data=(X[test], Y[test]), \
    epochs=10,batch_size=2048, verbose=2)
    _, train_acc = model.evaluate(X[train], Y[train], verbose=0)
    _, test_acc = model.evaluate(X[test], Y[test], verbose=0)
    print(‘Train: %.3f, Test: %.3f’ % (train_acc, test_acc))

    plt.plot(history.history[‘acc’], label=’train’)
    plt.plot(history.history[‘val_acc’], label=’test’)
    cvscores.append(test_acc * 100)

    print(“%.2f%% (+/- %.2f%%)” % (np.mean(cvscores)\
    , np.std(cvscores)))

  76. Avatar
    mustafa mohammed September 1, 2019 at 4:22 am #

    Hello dear
    How to do k-Fold Cross Validation on this code?
    model = Sequential()
    model.add(LSTM(150, input_shape=(train_X.shape[1], train_X.shape[2])))#,return_sequences=True
    model.compile(loss=’mae’, optimizer=’adam’)#,metrics=[‘accuracy’])#mean_squared_error

    # Fit the model
    history = model.fit(train_X, train_y, epochs=150,validation_data=(test_X, test_y),batch_size=24,verbose=2,shuffle=False)

    #pyplot.plot(history.history[‘loss’], label=’train’)

  77. Avatar
    mustafa mohammed September 1, 2019 at 9:22 pm #

    hello again

    When adding accuracy there is no change in the accuracy of the training values and the accuracy remains zero with the test values? why?

    # design network
    learning_rate = 0.001
    model = Sequential()
    model.add(LSTM(150, input_shape=(train_X.shape[1], train_X.shape[2])))#,return_sequences=True
    model.compile(loss=’mae’, optimizer=’adam’,metrics=[‘accuracy’])#mean_squared_error

    # Fit the model
    history = model.fit(train_X, train_y, epochs=50,validation_data=(test_X, test_y),batch_size=24,verbose=2,shuffle=False)

  78. Avatar
    mustafa mohammed September 4, 2019 at 2:25 am #

    Thank you very much for the information answers
    I have the last request, i want particular swarm optimization ( PSO ) code in python to find the best weights and bias and the best number of hidden layers and the number of nodes in each hidden layer.
    Thank you very much and grateful for the publication of science

    • Avatar
      Jason Brownlee September 4, 2019 at 6:02 am #

      Sorry, I don’t have any tutorials on PSO for neural nets.

  79. Avatar
    Jack September 9, 2019 at 6:20 am #

    Hi Jason, I’m very interested in your StratifiedKFold part, I tried it myself but I got an error, the code is as below:
    from keras.datasets import boston_housing

    (train_data, train_targets), (test_data, test_targets) = boston_housing.load_data()

    # normalize the data
    mean = train_data.mean(axis=0)
    train_data -= mean
    std = train_data.std(axis=0)
    train_data /= std

    test_data -= mean
    test_data /= std

    X_train = train_data
    y_train = train_targets
    X_test = test_data
    y_test = test_targets
    from keras import models
    from keras import layers
    from sklearn.model_selection import StratifiedKFold

    def build_model():
    # Because we will need to instantiate
    # the same model multiple times,
    # we use a function to construct it.
    model = models.Sequential()
    model.add(layers.Dense(64, activation=’relu’,
    model.add(layers.Dense(64, activation=’relu’))
    model.compile(optimizer=’rmsprop’, loss=’mse’, metrics=[‘mae’])
    return model

    model = build_model()
    model.fit(X_train, y_train)

    kfold = StratifiedKFold(n_splits=10)
    for train, test in kfold.split(X_train, y_train):
    model = build_model()
    model.fit(train, test)
    print(model.evaluate(X_test, y_test, verbose = 0))

    And the error is ValueError: Supported target types are: (‘binary’, ‘multiclass’). Got ‘continuous’ instead.
    It seems that this method doesn’t apply for float training set since my X_train looks like:

    array([[-0.27224633, -0.48361547, -0.43576161, …, 1.14850044,
    0.44807713, 0.8252202 ],
    [-0.40342651, 2.99178419, -1.33391162, …, -1.71818909,
    0.43190599, -1.32920239],
    [ 0.1249402 , -0.48361547, 1.0283258 , …, 0.78447637,
    0.22061726, -1.30850006],

    So how am I supposed to do cross validation in deep learning with this kind of data?

    • Avatar
      Jason Brownlee September 9, 2019 at 1:53 pm #

      Sorry to hear that.

      I believe StratifiedKFold is only appropriate for classification predictive modeling problems, not regression problems.

  80. Avatar
    David November 5, 2019 at 6:34 am #

    In the last line, you compute numpy.std(cvscores). How do you use this information?

  81. Avatar
    Sana November 28, 2019 at 4:44 am #

    Hi Jason,
    I want to use kfold.split(x,y) but I have this error: ValueError: Found array with dim 3. Estimator expected <= 2. because my x.shape=(2000, 1400, 3, 6) and y.shape=(2000, 1400, 3)
    Should i reshape my input data to 2d array ?
    Thank you

  82. Avatar
    Kamal Pandey December 12, 2019 at 12:44 am #

    How can i use k-fold validation with this one? Help is appreciated.

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)
    tokenizer = Tokenizer(num_words=MAX_NUM_WORDS)
    X_train = tokenizer.texts_to_sequences(X_train)
    X_test = tokenizer.texts_to_sequences(X_test)
    vocab_size = len(tokenizer.word_index) + 1
    X_train = pad_sequences(X_train, maxlen=MAX_SEQUENCE_LENGTH)
    X_test = pad_sequences(X_test, maxlen=MAX_SEQUENCE_LENGTH)
    def model():
    input_shape = (MAX_SEQUENCE_LENGTH,)
    model_input = Input(shape = input_shape, name = “input”, dtype = ‘int32′)
    embedding = glove_embd(model_input)
    lstm = LSTM(100, dropout=0.3, recurrent_dropout=0.3, name=”lstm”)(embedding)
    model_output = Dense(2, activation=’softmax’, name=”softmax”)(lstm)
    model = Model(inputs=model_input, outputs=model_output)
    return model
    model = model()
    optimizer = ‘adam’,
    metrics = [‘accuracy’])
    history = model.fit(X_train, Y_train, batch_size=1500, epochs=50, verbose=1, validation_data=(X_test, Y_test))
    loss, accuracy = model.evaluate(X_train, Y_train, verbose=False)

  83. Avatar
    Sruthy December 15, 2019 at 6:51 pm #

    HI Jason,

    Thank you for the information. Helped me a lot. I have a question. After k-fold validation. say 5-fold. We will get 5 accuracy values for each fold, right? After the whole 5-fold validation, if I need to do another testing using data from outside, we can call the model to evaluate, right? So I was wondering which trained network is I am calling? Is it the last folded one or the averaged model of all the 5 folds. Which is the final network?

    Kindly advice me for the same.

    Many Thanks,

  84. Avatar
    Mina January 5, 2020 at 3:50 am #

    Hi Jason,
    When I use
    scores = model.evaluate(X[test], Y[test], verbose=0), while metric that has been used in compile is ‘accuracy’, the result that I get is very different from when I compute accuracy for the predicted results. What is the problem?

    • Avatar
      Jason Brownlee January 5, 2020 at 7:07 am #

      Accuracy on the same data via evaluate and via manual calculation should be identical.

      If not, check for a bug in your code.

  85. Avatar
    Matthew February 16, 2020 at 6:29 am #

    Hey Jason!

    Love the article, I’ve been looking for a Tensorflow CV solution for a long time and this explains it perfectly. I do have a question though. Not looking for you to correct my code, just maybe give your insight into what’s going on here.

    I’m using almost an identical code to yours above for my dataset (I’ll show code below), but I’m running into an issue where after each training run of 200 epochs, the model gets closer and closer to 100% accuracy? I thought maybe my test dataset was remaining the same and the model was being fed a new training dataset fold after each iteration, but I checked and the test dataset does change each time, so I’m at a loss. My code and output looks as below:

    Any ideas? Love the site! Thanks again!

    • Avatar
      Jason Brownlee February 17, 2020 at 7:37 am #

      Yes, you must re-define the model for each fold of the CV. Otherwise the model just continues learning from the last fold.

      • Avatar
        Matthew February 17, 2020 at 4:45 pm #

        Right on, thanks so much for taking the time! And keep up the great work! We really appreciate it.

  86. Avatar
    AGGELOS PAPOUTSIS February 20, 2020 at 9:00 pm #

    hi jason,

    i am a little confused about your implementation. You say in the description that stratified k fold splits the training data in k folds. But then you use :

    for train, test in kfold.split(X, Y)

    so why you take the label Y?

    • Avatar
      Jason Brownlee February 21, 2020 at 8:21 am #

      So the folds are stratified by the class label.

  87. Avatar
    aggelos February 20, 2020 at 9:02 pm #

    hi jason,

    to you have any examples for LOO with lstm in keras?

  88. Avatar
    Mira March 14, 2020 at 8:21 am #

    Hi, in order for me to evaluate which accuracy and loss should I look at to get good evaluation prediction

    • Avatar
      Jason Brownlee March 14, 2020 at 9:53 am #

      I recommend tuning model performance based on loss.

      I recommend choosing a model based on out of sample performance using a metric that best captures your project goals.

  89. Avatar
    Bevan Smith April 16, 2020 at 7:56 pm #

    Thanks Jason, very helpful!

  90. Avatar
    Manohar May 28, 2020 at 5:01 am #

    Does placing

    model = Sequential()

    and recompiling not reset the model again?

    • Avatar
      Jason Brownlee May 28, 2020 at 6:21 am #

      It redefines the model.

      • Avatar
        Manohar May 28, 2020 at 5:12 pm #

        I thought we want to actually keep the same model with its weights etc. and retrain it.

  91. Avatar
    MD MAHMUDUL HASAN June 26, 2020 at 9:39 pm #

    Hi Jason, Very helpful tutorial. Are there any ways out there to calculate specificity in case of cross-validation?

  92. Avatar
    Ali July 26, 2020 at 7:57 pm #

    Hi Jason
    thanks for your tutorials.
    How to use the above model for calculating confusion matrix and F-score1 and recall and precision?

  93. Avatar
    ali July 27, 2020 at 11:17 pm #

    Thank you Jason, But I mean that how can I calculate confusion matrix, … after 10-fold cross-validation.
    I have two more questions:
    The following is my deep model:
    1. Every time that I run, I get different accuracy; What should I do to keep the accuracy value constant?
    2. The accuracy and loss of train and validation train have so much difference with each other, I think overfitting happens, I even used dropout but I did not succeed. My sample data is 8000, Is it because of lack of sample data or Is there another reason? Is there any way to reduce overfitting?

    # my input data is (8000, 100)

    MAX_NB_WORDS = 30000
    nb_words = max(MAX_NB_WORDS, len(tokenizer.word_index))
    model = Sequential()

    model.add(Dense(1, activation = ‘sigmoid’))

    # Run LSTM Model
    batch = 64
    epoch = 500
    LSTM_model = model.fit(X_train, Y_train, batch_size=batch, epochs=epoch,verbose = 1,shuffle = True, validation_split=0.1)

    test_loss , test_acc = model.evaluate(X_test, Y_test)
    pred = model.predict(X_test)

    Y_pred = []
    for p in pred:
    if p > 0.5:
    class_names = [‘Negative’, ‘Positive’]
    print(classification_report(Y_test.to_list(), Y_pred, target_names = class_names))
    cm = confusion_matrix(Y_test.to_list(),Y_pred)

  94. Avatar
    Jim August 14, 2020 at 12:06 am #

    Hi Jason. Thank you for your help on cross validation in Keras. I would like to do Stratified validation in LSTM for text classificatio (sentiment analysis). Is it possible? If yes can you send me a link from a github or something else? I can’t find a good tutorial on the Internet. Thank you.

  95. Avatar
    Anurag Maji September 8, 2020 at 4:39 am #

    Hi Jason,
    Loved your work, I am currently working on multi-label classification for audio tagging problem, to improve the performance I am planning to apply K-fold but is it possible to apply Stratification K-fold on muti label data easily? or even without Stratification will K-Fold help me to get good results?

    Thanks in advance

    • Avatar
      Jason Brownlee September 8, 2020 at 6:53 am #


      Not sure stratified k-fold CV is aware of multi-label problems. It is designed for multi-class problems as far as I know.

      Perhaps check the doco or test?

  96. Avatar
    yongkai LIU December 20, 2020 at 3:00 pm #

    where to put validate dataset in the cross validation?

    • Avatar
      Jason Brownlee December 21, 2020 at 6:31 am #

      Within each fold of cross-validation, you can split the training portion into train and validation and use the validation set to tune the model hyperparameters or early stopping for training.

  97. Avatar
    Speedster February 11, 2021 at 4:21 am #


    I’m trying to fit a neural network onto a dataset (regression) and I want to know whether my steps are correct to get a good and unbiased model

    1) Hyperparameter tuning based on 5-fold cross validation. I’m fitting on the training data and validate on my validation data while training. After training I’m calculating my metrics on the validation data. (I do not have a testing set)
    I use Early Stopping on my loss (not validation loss) to save some time.

    2) Train ALL of my data on the optimal model from the hyperparameter study.

    I’m wondering if this approach is okay, since I do not have a testing set. On the other hand I’ve already found a model which does not overfit and generalizes well. So the additional data I have should only improve my model.

    I am a little confused.

    Thanks for your help.

    • Avatar
      Jason Brownlee February 11, 2021 at 5:57 am #

      There is no idea of “objectively correct”. Choose a process that gives you confidence you have a robust model for your dataset.

  98. Avatar
    Tom June 2, 2021 at 4:20 am #

    Hi Jason,

    Since for each epoch, we will have different validation accuracy what would be your suggestion to report the final result for publication.
    For example, I used 70% of my data for training and 30% for validation.
    Validation accuracy for 100 epochs started from 62% (epoch 1) and reached 91% (epoch 100).
    How can I report a single number in my paper for my model accuracy?
    Should I average the accuracy of 100 epochs?

    Thank you.

    • Avatar
      Jason Brownlee June 2, 2021 at 5:47 am #

      You can use any evaluation procedure you want as long as it is clearly stated, consistent in comparison of methods, and reprodcable.

  99. Avatar
    NORAH October 19, 2021 at 3:43 am #

    why I get this error?

    TypeError Traceback (most recent call last)
    11 print(train, test)
    12 # Fit the model
    —> 13 model.fit(X[train], Y[train], epochs=20, batch_size=7, verbose=1)
    14 # evaluate the model
    15 scores = model.evaluate(X[test], Y[test], verbose=1)

    TypeError: only integer scalar arrays can be converted to a scalar index

  100. Avatar
    John January 24, 2022 at 5:25 am #

    Hello Dr. Brownlee! Excellent tutorial…One question only…

    When it runs this command:

    print(“%s: %.2f%%” % (model.metrics_names[1], scores[1]*100))

    I get this error:

    TypeError: ‘float’ object is not subscriptable

    How can I fix that?

  101. Avatar
    Josh March 17, 2022 at 2:32 pm #


    From your article, I was not able to have a firm understanding in just one part.
    If I have a program for categorical crossentropy for CNN using Tensorflow, and for the code

    model.compile(loss=’categorical_crossentropy’, metrics=[‘accuracy’])
    model.fit(xTrain, yTrain, epochs=100)
    loss, accuracy = model.evaluate(xTest, yTest)

    Can I ask you how “accuracy”, which is returned by evaluate(), is calculated?
    I hope to know whether it is an average number of correct prediction. (Total number of correct predictions / Total number of samples).

    However, I was not able to find how accuracy from evaluate() from Tensorflow is calculated.
    If my opinion is correct, could I ask you which Tensorflow website I could refer to so that I could be sure? Thank you!

  102. Avatar
    Tarun July 12, 2022 at 11:59 pm #

    How to find Train MAPE and Test MAPE as well as Train Median Absolute Error and Test Median Absolute Error in MLP / CNN / LSTM ?

    Please let me know.

  103. Avatar
    Yasir December 12, 2022 at 10:34 am #

    hi Jason, I have a question
    can we follow these steps ?
    1- splitting data into train and test
    2- we apply train data in k fold cross validation (the train data will be as train and validation data)
    3- find the accuracy for k fold cross validation
    4- find the accuracy for test data
    5- the final step is to make comparison between the tow accuracy to check the overfitting

    is the work right or not ?

    thank you very much

  104. Avatar
    Omar March 3, 2023 at 9:50 pm #

    Hi Jason, thank you for the thread. I have a question, how does Keras validation_split prevent information leakage?

    As we need to scale the training data before fitting it to the model, how would using validation_split prevent us from giving information leakage?

    So far from what I have learned, we can only fit the scaler on the training set, and then apply the fitted scaler to the test set. In this case, I believe we are treating the validation set as a “test set” during training to improve the model’s performance, hence the confusion. Because if we scale the entire training set, we will end up not having unseen data for the validation set.

    Thank you.

  105. Avatar
    Yeon T April 28, 2023 at 6:52 pm #

    Hey! Why do you multiply the accuracy score by 100? And does it make any sense to evaluate normalized data? I really struggle to understand that…

  106. Avatar
    Michio Suginoo October 13, 2023 at 2:43 am #

    Hi Jason, thanks for a great post.
    I have a question regarding Stratified K-fold.
    Why did you choose to use sckit-learn’s function for that?
    Doesn’t Keras have any built-in function in order to execute this?
    If it does, can you give me a good resource to learn how to do that with Kera’s built-in function?


Leave a Reply