How to Load Large Datasets From Directories for Deep Learning in Keras

Last Updated on

There are conventions for storing and structuring your image dataset on disk in order to make it fast and efficient to load and when training and evaluating deep learning models.

Once structured, you can use tools like the ImageDataGenerator class in the Keras deep learning library to automatically load your train, test, and validation datasets. In addition, the generator will progressively load the images in your dataset, allowing you to work with both small and very large datasets containing thousands or millions of images that may not fit into system memory.

In this tutorial, you will discover how to structure an image dataset and how to load it progressively when fitting and evaluating a deep learning model.

After completing this tutorial, you will know:

  • How to organize train, test, and validation image datasets into a consistent directory structure.
  • How to use the ImageDataGenerator class to progressively load the images for a given dataset.
  • How to use a prepared data generator to train, evaluate, and make predictions with a deep learning model.

Discover how to build models for photo classification, object detection, face recognition, and more in my new computer vision book, with 30 step-by-step tutorials and full source code.

Let’s get started.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Dataset Directory Structure
  2. Example Dataset Structure
  3. How to Progressively Load Images

Dataset Directory Structure

There is a standard way to lay out your image data for modeling.

After you have collected your images, you must sort them first by dataset, such as train, test, and validation, and second by their class.

For example, imagine an image classification problem where we wish to classify photos of cars based on their color, e.g. red cars, blue cars, etc.

First, we have a data/ directory where we will store all of the image data.

Next, we will have a data/train/ directory for the training dataset and a data/test/ for the holdout test dataset. We may also have a data/validation/ for a validation dataset during training.

So far, we have:

Under each of the dataset directories, we will have subdirectories, one for each class where the actual image files will be placed.

For example, if we have a binary classification task for classifying photos of cars as either a red car or a blue car, we would have two classes, ‘red‘ and ‘blue‘, and therefore two class directories under each dataset directory.

For example:

Images of red cars would then be placed in the appropriate class directory.

For example:

Remember, we are not placing the same files under the red/ and blue/ directories; instead, there are different photos of red cars and blue cars respectively.

Also recall that we require different photos in the train, test, and validation datasets.

The filenames used for the actual images often do not matter as we will load all images with given file extensions.

A good naming convention, if you have the ability to rename files consistently, is to use some name followed by a number with zero padding, e.g. image0001.jpg if you have thousands of images for a class.

Want Results with Deep Learning for Computer Vision?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Download Your FREE Mini-Course

Example Dataset Structure

We can make the image dataset structure concrete with an example.

Imagine we are classifying photographs of cars, as we discussed in the previous section. Specifically, a binary classification problem with red cars and blue cars.

We must create the directory structure outlined in the previous section, specifically:

Let’s actually create these directories.

We can also put some photos in the directories.

You can use the creative commons image search to find some images with a permissive license that you can download and use for this example.

I will use two images:

Red Car, by Dennis Jarvis

Red Car, by Dennis Jarvis

Blue Car, by Bill Smith

Blue Car, by Bill Smith

Download the photos to your current working directory and save the photo of the red car as ‘red_car_01.jpg‘ and the photo of the blue car as ‘blue_car_01.jpg‘.

We must have different photos for each of the train, test, and validation datasets.

In the interest of keeping this tutorial focused, we will re-use the same image files in each of the three datasets but pretend they are different photographs.

Place copies of the ‘red_car_01.jpg‘ file in data/train/red/, data/test/red/, and data/validation/red/ directories.

Now place copies of the ‘blue_car_01.jpg‘ file in data/train/blue/, data/test/blue/, and data/validation/blue/ directories.

We now have a very basic dataset layout that looks like the following (output from the tree command):

Below is a screenshot of the directory structure, taken from the Finder window on macOS.

Screenshot of Image Dataset Directory and File Structure

Screenshot of Image Dataset Directory and File Structure

Now that we have a basic directory structure, let’s practice loading image data from file for use with modeling.

How to Progressively Load Images

It is possible to write code to manually load image data and return data ready for modeling.

This would include walking the directory structure for a dataset, loading image data, and returning the input (pixel arrays) and output (class integer).

Thankfully, we don’t need to write this code. Instead, we can use the ImageDataGenerator class provided by Keras.

The main benefit of using this class to load the data is that images are loaded for a single dataset in batches, meaning that it can be used for loading both small datasets as well as very large image datasets with thousands or millions of images.

Instead of loading all images into memory, it will load just enough images into memory for the current and perhaps the next few mini-batches when training and evaluating a deep learning model. I refer to this as progressive loading, as the dataset is progressively loaded from file, retrieving just enough data for what is needed immediately.

Two additional benefits of the using the ImageDataGenerator class is that it can also automatically scale pixel values of images and it can automatically generate augmented versions of images. We will leave these topics for discussion in another tutorial and instead focus on how to use the ImageDataGenerator class to load image data from file.

The pattern for using the ImageDataGenerator class is used as follows:

  1. Construct and configure an instance of the ImageDataGenerator class.
  2. Retrieve an iterator by calling the flow_from_directory() function.
  3. Use the iterator in the training or evaluation of a model.

Let’s take a closer look at each step.

The constructor for the ImageDataGenerator contains many arguments to specify how to manipulate the image data after it is loaded, including pixel scaling and data augmentation. We do not need any of these features at this stage, so configuring the ImageDataGenerator is easy.

Next, an iterator is required to progressively load images for a single dataset.

This requires calling the flow_from_directory() function and specifying the dataset directory, such as the train, test, or validation directory.

The function also allows you to configure more details related to the loading of images. Of note is the ‘target_size‘ argument that allows you to load all images to a specific size, which is often required when modeling. The function defaults to square images with the size (256, 256).

The function also allows you to specify the type of classification task via the ‘class_mode‘ argument, specifically whether it is ‘binary‘ or a multi-class classification ‘categorical‘.

The default ‘batch_size‘ is 32, which means that 32 randomly selected images from across the classes in the dataset will be returned in each batch when training. Larger or smaller batches may be desired. You may also want to return batches in a deterministic order when evaluating a model, which you can do by setting ‘shuffle‘ to ‘False.’

There are many other options, and I encourage you to review the API documentation.

We can use the same ImageDataGenerator to prepare separate iterators for separate dataset directories. This is useful if we would like the same pixel scaling applied to multiple datasets (e.g. trian, test, etc.).

Once the iterators have been prepared, we can use them when fitting and evaluating a deep learning model.

For example, fitting a model with a data generator can be achieved by calling the fit_generator() function on the model and passing the training iterator (train_it). The validation iterator (val_it) can be specified when calling this function via the ‘validation_data‘ argument.

The ‘steps_per_epoch‘ argument must be specified for the training iterator in order to define how many batches of images defines a single epoch.

For example, if you have 1,000 images in the training dataset (across all classes) and a batch size of 64, then the steps_per_epoch would be about 16, or 1000/64.

Similarly, if a validation iterator is applied, then the ‘validation_steps‘ argument must also be specified to indicate the number of batches in the validation dataset defining one epoch.

Once the model is fit, it can be evaluated on a test dataset using the evaluate_generator() function and passing in the test iterator (test_it). The ‘steps‘ argument defines the number of batches of samples to step through when evaluating the model before stopping.

Finally, if you want to use your fit model for making predictions on a very large dataset, you can create an iterator for that dataset as well (e.g. predict_it) and call the predict_generator() function on the model.

Let’s use our small dataset defined in the previous section to demonstrate how to define an ImageDataGenerator instance and prepare the dataset iterators.

A complete example is listed below.

Running the example first creates an instance of the ImageDataGenerator with all default configuration.

Next, three iterators are created, one for each of the train, validation, and test binary classification datasets. As each iterator is created, we can see debug messages reporting the number of images and classes discovered and prepared.

Finally, we test out the train iterator that would be used to fit a model. The first batch of images is retrieved and we can confirm that the batch contains two images, as only two images were available. We can also confirm that the images were loaded and forced to the square dimensions of 256 rows and 256 columns of pixels and the pixel data was not scaled and remains in the range [0, 255].

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

API

Articles

Summary

In this tutorial, you discovered how to structure an image dataset and how to load it progressively when fitting and evaluating a deep learning model.

Specifically, you learned:

  • How to organize train, test, and validation image datasets into a consistent directory structure.
  • How to use the ImageDataGenerator class to progressively load the images for a given dataset.
  • How to use a prepared data generator to train, evaluate, and make predictions with a deep learning model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Develop Deep Learning Models for Vision Today!

Deep Learning for Computer Vision

Develop Your Own Vision Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Deep Learning for Computer Vision

It provides self-study tutorials on topics like:
classification, object detection (yolo and rcnn), face recognition (vggface and facenet), data preparation and much more...

Finally Bring Deep Learning to your Vision Projects

Skip the Academics. Just Results.

See What's Inside

32 Responses to How to Load Large Datasets From Directories for Deep Learning in Keras

  1. Tony Holdroyd April 12, 2019 at 6:31 am #

    Really clear and very useful tutorial as always Jason.
    I was wondering what you thought about using Keras from within TensorFlow, i.e using tf.keras.whatever? Thanks

    • Jason Brownlee April 12, 2019 at 7:59 am #

      Thanks Tony!

      No opinion at this stage. I may cover it later in the year once TF 2.0 is finalized.

      Same result at the end of the day, although more confusing for developers because TF gives you so many ways to do the same thing.

  2. Xu Zhang April 12, 2019 at 10:16 am #

    Thank you Jason. Your tutorials are always clear and very useful.

    If my datasets are not images but large, how can I load my data progressively? Many thanks

    • Jason Brownlee April 12, 2019 at 2:42 pm #

      You can write a custom data generator to load and yield data in batches.

      I give an example in this post for text+images:
      https://machinelearningmastery.com/develop-a-deep-learning-caption-generation-model-in-python/

      • Xu Zhang April 17, 2019 at 4:49 am #

        Thank you Jason.
        I read the link you suggested, but still have some questions.

        1. Normally, when we train our data, we need to shaffle them before training. There is no problem if we can load our data into our memory. I don’t know when you did shaffle when using data_generator. Do I need to shaffle my data after I collected my data?
        2. Is it possible to change data into .h5 files, then load into the memory one by one?

        Many thanks.

        • Jason Brownlee April 17, 2019 at 7:04 am #

          Yes, it is a good idea to try and draw the samples randomly.

          Sure.

          • Xu Zhang April 18, 2019 at 6:18 am #

            Thank you, Jason.

            I couldn’t find where and when you did shuffle.

            In addition, Every time, after I posted a comment, I have to go to your blog where I posted it again to look for if there is your reply. Is it possible to receive an email about your reply or a notification? Many thanks.

          • Jason Brownlee April 18, 2019 at 8:56 am #

            You can implement a shuffle as part of your data generator. I don’t have an example.

            Thanks for the suggestion about email replies, I’ll look into it.

  3. Paulo Henrique Zen Messerschmidt April 13, 2019 at 10:31 am #

    Jason, thank you for this great tutorial.

    I’m trying to solve a problem in my model which seems almost exactly your example, except that i’m rescaling the image pixels by 1./255.

    In my model, i’m using the test set to evaluate my model. But each time i run:

    model.evaluate_generator(test_it, steps=test.sample // batch size=32) the functiont returns different scores. I’ve already tried to set shuffle=False in test_it, but doesn’t solve the problem.

    I’m using checkpointer (callbacks) to save the best params (weights) based on validation_loss. After, i load the weights to the model and applied the evaluate_generator as i said. Am i doing something?

    Ps.: Sorry, english isn’t my first language.

    Many thanks!!

    • Jason Brownlee April 13, 2019 at 1:49 pm #

      The model will return different scores because the model is different each time it is run – by design.

      Neural networks are stochastic learning algorithms.

      This will help:
      https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code

      • Paulo Henrique Zen Messerschmidt April 14, 2019 at 2:20 am #

        I guess I was not very clear on my question. In fact, when I referred to running the model.evaluate_generator several times, I was referring only to this line of code, keeping the network trained (same weights). But, following your recommendations i set the test generator as:

        test_generator = datagen_test.flow_from_directory (
            ‘content / cell_images / test’, #Test folder path
            target_size = (150,150), #all images will be resized to 150×150
            batch_size = 1,
            class_mode = ‘categorical’
            shuffle = False)

        And the steps parameter on model.evaluate_generator with the test_generator.samples:

        score = model.evaluate_generator (test_generator, test_generator.samples)

        Apparently it worked, the accuracy test values ​​are no longer changed if I run this line of code several times for the same neural network weights

        Again, many thanks for your response Jason, it helped me a lot.

        • Jason Brownlee April 14, 2019 at 5:51 am #

          Well done, I’m happy to hear that.

          • touati May 18, 2019 at 7:55 am #

            Hi ? Great job ! but if i do annotation , what about classification which the one is the better for car make model classification ?

          • Jason Brownlee May 19, 2019 at 7:48 am #

            I recommend testing a suite of models in order to discover what works best for your specific dataset.

  4. Jacob Rose June 3, 2019 at 3:19 pm #

    Hi Jason,

    This was a super clear and useful tutorial, but I seem to be struggling to find a solution to my slightly adjacent problem!

    I have all of the data stored in class-based subdirectories, such as “.\data\flower”, “.\data\cat”, etc, where flower and cat are two different classes.

    However, while I want to load the images into my model at run-time rather than hold a large number in memory, I would really not like to hard-code my train/validation/test splits without first going through a range of possible splits. If I were to do it exactly as you describe here, I would have to create a copy of the entire data set for each different split I test.

    Do you know if there is an optimized way of doing this that doesnt require defining a custom generator? Ideally something that can take a list of filenames with their directories and distribute them to train/val/testing lists respectively, then feed each list into a tensorflow or keras generator at runtime to actually load the images.

    Please let me know if I should clarify any of that. Thanks!

    • Jason Brownlee June 4, 2019 at 7:43 am #

      Excellent question Jacob.

      One idea would be to write a script to create the validation dataset on disk by moving files around, then create one ImageDataGenerator instance for train and one for val for the two different directories.

      Another idea would be to create a custom data generator function/functions that use whatever arguments you like to split the images into train/val.

      I hope that gives you some ideas.

  5. Srinivas June 4, 2019 at 7:10 am #

    I have trained my cnn model and now I want test my model now, could you please provide me a testing code for the same.

  6. Bisrat June 5, 2019 at 6:55 am #

    Hi Jason,
    Thank you for the tutorial, you always describe things in a simple way.
    Do you have tutorial on how to use TFRecords for keras ?

    thanks.

  7. Abinash Kumar Chaudhary June 5, 2019 at 6:50 pm #

    Thanks Jason,
    I have separate test and train sets and I have loaded it successfully using the iterator. Please Help me in defining the model. Its a multi class classification of digits.

  8. Heidi Hardner June 26, 2019 at 1:21 am #

    I am using the save_to_dir() in flow_from_directory to look at what got generated by the generator and I’m confused by the results. I get widely varying numbers of images going into the folder. Trying all different batch and steps values just making me more confused. In your example of 1000 images with batch size 64 and steps per epoch 16 do you see 1000 or 1024 images if you save_to_dir?

    Not sure if I’m not understanding how batch_size,etc. actually generate sets of images or whether save_to_dir actually doesn’t do what I expected (show me all the images used) or whether I’m not getting expected results for some reason.

    Examples (using small folders of images to try to understand what it does…):
    I have 120 images across 6 classes, so 20 each.
    batch_size 6, steps_per_epoch 20 yields 186 images, not 120.
    batch_size 32, steps_per_epoch 4 yields 360 images, not 128
    batch_size 32, steps_per_epoch 8 yields 576 images, not 256, not double what I got with 4.
    batch_size 12, steps_per_epoch 10 yields 252 images
    etc.

    Where it yields more images than I started with they are flipped or stretched as defined in the ImageDataGenerator() and presumably that is the intent of augmenting – you could use several different modified copies of the same image to get more variety. I just don’t see the batch_sizes and steps making sense with images generated…

    Also like in your example I have a validation generator with no stretching or flipping or zooming and for that I really just do want to use the images once I think but I don’t see how to get that result. choosing batch_size of 1 and validation_steps equal to the number of images results in save_to_dir generally saving more images (copies of the intial images):

    60 total validation images :
    batch_size 1, 60 validation_steps, 71 images saved by save_to_dir
    batch_size 12, 5 validation_steps, 156 images saved

    everything I read about this say batch_size*steps= number of images but I can’t seem to make sense of what happens.

    • Jason Brownlee June 26, 2019 at 6:44 am #

      Yes, I believe it prepares more than are needed, e.g. in a queue to ensure the computation is efficient.

      • Heidi Hardner June 27, 2019 at 12:24 am #

        Thanks! I guess that is the best outcome, all is OK and I should just look in the directory to generally see what the stretching and such did, not get wrapped up in the quantities. This has been driving me crazy.

        • Jason Brownlee June 27, 2019 at 7:55 am #

          It is a really good idea to inspect the augmentation that is performed to confirm it makes sense. It is so easy to just run code without thinking hard about it.

  9. Swati Verma October 11, 2019 at 1:17 pm #

    Hello Jason,

    Thanks for nice tutorial. I have a question I have .7z large data file which contains images and I am having problem to load image data via .7z file. Any help? Thanks!!

  10. Lakshay Chhabra October 17, 2019 at 8:13 pm #

    How to plot confusion matrix with this structure?

  11. Evan Zamir October 22, 2019 at 1:48 am #

    I have used this method to load images but it seems to be very inefficient in terms of utilizing the GPU. My understanding is that you can feed a TF DataSet generator to Keras and gain much more efficiency doing so. Unfortunately I haven’t been able to figure out precisely how to do this. Neither the Keras nor TF documentation has such an example and every blog post or SO answer I’ve seen uses the precanned MNIST or CIFAR datasets which aren’t stored on the file system. It would be great to see a post that shows how to load ImageNet files into Keras. If I ever figure it out I will write such a post! haha

    • Jason Brownlee October 22, 2019 at 5:56 am #

      Great suggestion, thanks Evan.

      For the mean time, try larger batches or a custom data generator that loads a ton more data into memory.

Leave a Reply