SALE! Use code blackfriday for 40% off everything!
Hurry, sale ends soon! Click to see the full catalog.

Image Datasets for Practicing Machine Learning in OpenCV

At the very start of your machine learning journey, making use of publicly available datasets alleviates the worry of having to create the datasets yourself, and rather lets you focus on learning to use the machine learning algorithms. It also helps if the datasets are moderately sized and do not require too much pre-processing, to get you to practice using the algorithms quicker before moving on to more challenging problems. 

Two datasets that we will be looking at are the simpler digits dataset that is provided with OpenCV, and the more challenging but widely used CIFAR-10 dataset. We will be using any of these two datasets during our journey through OpenCV’s machine learning algorithms. 

In this tutorial, you are going to learn how to download and extract the OpenCV digits and CIFAR-10 datasets, for practicing machine learning in OpenCV. 

After completing this tutorial, you will know:

  • How to download and extract the OpenCV digits dataset. 
  • How to download and extract the CIFAR-10 dataset without necessarily relying on other Python packages (such as TensorFlow).

Let’s get started. 

Image Datasets for Practicing Machine Learning in OpenCV
Photo by OC Gonzalez, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  • The Digits Dataset
  • The CIFAR-10 Dataset
  • Loading the Datasets

The Digits Dataset

OpenCV provides the image, digits.png, that is composed of a ‘collage’ of 20$\times$20 pixel sub-images, where each sub-image features a digit from 0 to 9 and which may be split up to create a dataset. In total, the digits image contains 5,000 handwritten digits. 

The digits dataset provided by OpenCV is not necessarily representative of the real-life challenges that come with more complex datasets, primarily because its image content features very limited variation. However, its simplicity and ease-of-use will permit us to test several machine learning algorithms quickly, at a low pre-processing and computational cost. 

In order to be able to extract the dataset from the full digits image, our first step is to split it into the many sub-images that make it up. For this purpose, let’s create the following split_images function:

The split_images function takes as input the path to the full image, together with the pixel size of the sub-images. Since we are working with square sub-images, we shall be denoting their size by a single dimension, which is equal to 20. 

The function subsequently applies the OpenCV imread method to load a grayscale version of the image into a NumPy array. The hsplit and vsplit methods are then used to split the NumPy array horizontally and vertically, respectively. 

The array of sub-images that the split_images function returns is of size, (50, 100, 20, 20).

Once we have extracted the array of sub-images, we shall now proceed to partition it into a training set and a testing set. We will also need to create the ground truth labels for both splits of data to be used during the training process, and for the evaluation of the test results. 

The following  split_data function serves these purposes:

The split_data function takes the array of sub-images as input, as well as the split ratio for the training portion of the dataset. The function then proceeds to compute the partition value that divides the array of sub-images along its columns into a training set and a testing set. This partition value is then used to allocate the first set of columns to the training data, and the remaining set of columns to the testing data. 

To visualize this partitioning on the digits.png image, this would appear as follows:

Partitioning the sub-images into a training dataset and a testing dataset

You may also note that we are flattening out every 20$\times$20 sub-image into a one-dimensional vector of length 400 pixels such that, in the arrays containing the training and testing images, every row now stores a flattened out version of a 20$/times$20 pixel image.

The final part of the split_data function creates ground truth labels with values within a range between 0 and 9, and repeats these values according to how many training and testing images we have available. 

The CIFAR-10 Dataset

The CIFAR-10 dataset is not provided with OpenCV but we shall be considering because it represents real-world challenges better than OpenCV’s digits dataset. 

The CIFAR-10 dataset consists of a total of 60,000, 32$\times$32 RGB images. It features a variety of images belonging to 10 different classes, such as airplane, cat and ship. The dataset files are readily split into 5 pickle files that contain 1,000 training images and labels each, plus an additional pickle file that contains 1,000 testing images and labels. 

Let’s go ahead and download the CIFAR-10 dataset for Python from this link (note: the reason for not using TensorFlow/Keras to do so, is to show how we can work without relying on additional Python packages if need be). Take note of the path on your hard disk to which you have saved and extracted the dataset. 

The following code loads the dataset files and returns the training and testing, images and labels:

It is important to keep in mind that the compromise of testing out different models using a larger and more varied dataset such as the CIFAR-10, over a simpler one such as the digits dataset, is that training on the former might be more time-consuming. 

Loading the Datasets

Let’s try calling the functions that we have created above. 

I have separated the code belonging to the digits dataset, from the code belonging to the CIFAR-10 dataset, into two different Python scripts that I named and, respectively:

Note: Do not forget to change the paths in the code above to where you have saved your data files. 

In the subsequent tutorials, we shall be seeing how to use these datasets with different machine learning techniques, starting first with seeing how to convert the dataset images into feature vectors as one of the pre-processing steps before using them for machine learning.  

Further Reading

This section provides more resources on the topic if you are looking to go deeper.




In this tutorial, you learned how to download and extract the OpenCV digits and CIFAR-10 datasets, for practicing machine learning in OpenCV.

Specifically, you learned:

  • How to download and extract the OpenCV digits dataset. 
  • How to download and extract the CIFAR-10 dataset without necessarily relying on other Python packages (such as TensorFlow).

Do you have any questions?
Ask your questions in the comments below, and I will do my best to answer.

, , , ,

No comments yet.

Leave a Reply