Training a PyTorch Model with DataLoader and Dataset

When you build and train a PyTorch deep learning model, you can provide the training data in several different ways. Ultimately, a PyTorch model works like a function that takes a PyTorch tensor and returns you another tensor. You have a lot of freedom in how to get the input tensors. Probably the easiest is to prepare a large tensor of the entire dataset and extract a small batch from it in each training step. But you will see that using the DataLoader can save you a few lines of code in dealing with data.

In this post, you will see how you can use the the Data and DataLoader in PyTorch. After finishing this post, you will learn:

  • How to create and use DataLoader to train your PyTorch model
  • How to use Data class to generate data on the fly

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.


Let’s get started.

Training a PyTorch Model with DataLoader and Dataset
Photo by Emmanuel Appiah. Some rights reserved.

Overview

This post is divided into three parts; they are:

  • What is DataLoader?
  • Using DataLoader in a Training Loop

What is DataLoader?

To train a deep learning model, you need data. Usually data is available as a dataset. In a dataset, there are a lot of data sample or instances. You can ask the model to take one sample at a time but usually you would let the model to process one batch of several samples. You may create a batch by extracting a slice from the dataset, using the slicing syntax on the tensor. For a better quality of training, you may also want to shuffle the entire dataset on each epoch so no two batch would be the same in the entire training loop. Sometimes, you may introduce data augmentation to manually introduce more variance to the data. This is common for image-related tasks, which you can randomly tilt or zoom the image a bit to generate a lot of data sample from a few images.

You can imagine there can be a lot of code to write to do all these. But it is much easier with the DataLoader.

The following is an example of how create a DataLoader and take a batch from it. In this example, the sonar dataset is used and ultimately, it is converted into PyTorch tensors and passed on to DataLoader:

You can see from the output of above that X_batch and y_batch are PyTorch tensors. The loader is an instance of DataLoader class which can work like an iterable. Each time you read from it, you get a batch of features and targets from the original dataset.

When you create a DataLoader instance, you need to provide a list of sample pairs. Each sample pair is one data sample of feature and the corresponding target. A list is required because DataLoader expect to use len() to find the total size of the dataset and using array index to retrieve a particular sample. The batch size is a parameter to DataLoader so it knows how to create a batch from the entire dataset. You should almost always use shuffle=True so every time you load the data, the samples are shuffled. It is useful for training because in each epoch, you are going to read every batch once. When you proceed from one epoch to another, as DataLoader knows you depleted all the batches, it will re-shuffle so you get a new combination of samples.

Want to Get Started With Deep Learning with PyTorch?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Using DataLoader in a Training Loop

The following is an example to make use of DataLoader in a training loop:

You can see that once you created the DataLoader instance, the training loop can only be easier. In the above, only the training set is packaged with a DataLoader because you need to loop through it in batches. You can also create a DataLoader for the test set and use it for model evaluation, but since the accuracy is computed over the entire test set rather than in a batch, the benefit of DataLoader is not significant.

Putting everything together, below is the complete code.

Create Data Iterator using Dataset Class

In PyTorch, there is a Dataset class that can be tightly coupled with the DataLoader class. Recall that DataLoader expects its first argument can work with len() and with array index. The Dataset class is a base class for this. The reason you may want to use Dataset class is there are some special handling before you can get the data sample. For example, data should be read from database or disk and you only want to keep a few samples in memory rather than prefetch everything. Another example is to perform real-time preprocessing of data, such as random augmentation that is common in image tasks.

To use Dataset class, you just subclass from it and implement two member functions. Below is an example:

This is not the most powerful way to use Dataset but simple enough to demonstrate how it works. With this, you can create a DataLoader and use it for model training. Modifying from the previous example, you have the following:

You set up dataset as an instance of SonarDataset which you implemented the __len__() and __getitem__() functions. This is used in place of the list in the previous example to set up the DataLoader instance. Afterward, everything is the same in the training loop. Note that you still use PyTorch tensors directly for the test set in the example.

In the __getitem__() function, you take an integer that works like an array index and returns a pair, the features and the target. You can implement anything in this function: Run some code to generate a synthetic data sample, read data on the fly from the internet, or add random variations to the data. You will also find it useful in the situation that you cannot keep the entire dataset in memory, so you can load only the data samples that you need it.

In fact, since you created a PyTorch dataset, you don’t need to use scikit-learn to split data into training set and test set. In torch.utils.data submodule, you have a function random_split() that works with Dataset class for the same purpose. A full example is below:

It is very similar to the example you have before. Beware that the PyTorch model still needs a tensor as input, not a Dataset. Hence in the above, you need to use the default_collate() function to collect samples from a dataset into tensors.

Further Readings

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this post, you learned how to use DataLoader to create shuffled batches of data and how to use Dataset to provide data samples. Specifically you learned:

  • DataLoader as a convenient way of providing batches of data to the training loop
  • How to use Dataset to produce data samples
  • How combine Dataset and DataLoader to generate batches of data on the fly for model training

Get Started on Deep Learning with PyTorch!

Deep Learning with PyTorch

Learn how to build deep learning models

...using the newly released PyTorch 2.0 library

Discover how in my new Ebook:
Deep Learning with PyTorch

It provides self-study tutorials with hundreds of working code to turn you from a novice to expert. It equips you with
tensor operation, training, evaluation, hyperparameter optimization, and much more...

Kick-start your deep learning journey with hands-on exercises


See What's Inside

5 Responses to Training a PyTorch Model with DataLoader and Dataset

  1. Avatar
    Aditya February 26, 2023 at 12:32 am #

    Hi, is shuffling appropriate for forecasting problems

    • Adrian Tam
      Adrian Tam March 15, 2023 at 5:43 am #

      Usually no. It sounds like you’re talking about a time series problem and we do not want to lose the time ordering information. Therefore, shuffling is not recommended. But you can transform a time series into windows and shuffle the windows. Hope that helps.

  2. Avatar
    Matthew Avaylon August 8, 2023 at 2:52 am #

    In the intro you mentioned that pytorch models can handle taking a large tensor of data. Does that mean I can load in the MNIST dataset, set a x_train and y_train of all the training data as tensors and train like this?

    for epoch in range(20):

    print(“epoch:” + str(epoch))

    model.train() # puts the model in training mode

    y_pred = model(X_train) # Forward pass

    loss_calc = loss_func(y_pred, y_train)

    optimizer.zero_grad()

    loss_calc.backward()

    optimizer.step()

    I’m a little confused on what dataloader actually does. The loader iterates over the data in batches and will feed the batch to the model. Is that saying that its loading these batches into memory where each batch can be thought of a small tensor of data that is being fed into the model? Much like how we can feed a whole tensor of data in my opening statement, this is feeding a tensor of a batch.

  3. Avatar
    Peggy June 13, 2024 at 12:08 pm #

    For Multi-Instance Learning (MIL), my dataset includes unique IDs, features, and labels for training.

    For prediction, I need to provide the dataset with unique IDs and features, but without labels.

    Therefore, may I ask should modify my dataset class to handle data without labels during prediction?

    Thanks!

    • Avatar
      James Carmichael June 14, 2024 at 6:48 am #

      Hi Peggy…Yes, you should modify your dataset class to handle data without labels during prediction in Multi-Instance Learning (MIL). Typically, this involves creating a dataset class that can manage both training (with labels) and prediction (without labels) scenarios.

      Here’s a general approach to modifying your dataset class:

      ### 1. Define the Dataset Class
      You can create a dataset class that accepts data in both labeled and unlabeled forms. This class should be able to distinguish whether it’s being used for training or prediction based on the presence of labels.

      ### 2. Handling Different Scenarios
      You can add a parameter to indicate whether the dataset includes labels or not. If labels are not provided, the class should handle the data accordingly during prediction.

      ### Example in PyTorch

      Here’s a basic example in PyTorch to illustrate this:

      python
      import torch
      from torch.utils.data import Dataset

      class MILDataset(Dataset):
      def __init__(self, data, labels=None, mode='train'):
      """
      Args:
      data (list or array-like): List of features or instances.
      labels (list or array-like, optional): List of labels corresponding to the data. Default is None.
      mode (str): Either 'train' or 'predict' to indicate the mode of operation. Default is 'train'.
      """
      self.data = data
      self.labels = labels
      self.mode = mode

      def __len__(self):
      return len(self.data)

      def __getitem__(self, idx):
      if torch.is_tensor(idx):
      idx = idx.tolist()

      sample = self.data[idx]

      if self.mode == 'train':
      if self.labels is None:
      raise ValueError("Labels must be provided in training mode.")
      label = self.labels[idx]
      return sample, label
      elif self.mode == 'predict':
      return sample
      else:
      raise ValueError("Mode should be either 'train' or 'predict'.")

      # Example usage:
      # For training
      train_data = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]]
      train_labels = [0, 1, 0]
      train_dataset = MILDataset(data=train_data, labels=train_labels, mode='train')

      # For prediction
      predict_data = [[0.7, 0.8], [0.9, 1.0]]
      predict_dataset = MILDataset(data=predict_data, mode='predict')

      ### Explanation
      – **Initialization (__init__ method)**:
      – The data parameter holds the features.
      – The labels parameter is optional and only required in training mode.
      – The mode parameter specifies whether the dataset is for training or prediction.

      – **Length (__len__ method)**:
      – Returns the number of instances in the dataset.

      – **Get Item (__getitem__ method)**:
      – If in training mode ('train'), it returns a tuple of (sample, label).
      – If in prediction mode ('predict'), it returns only the sample (feature vector).

      ### Using the Dataset with DataLoader
      You can use this dataset class with PyTorch’s DataLoader for both training and prediction:

      python
      from torch.utils.data import DataLoader

      # For training
      train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)

      # For prediction
      predict_loader = DataLoader(predict_dataset, batch_size=2, shuffle=False)

      This structure allows your dataset class to be flexible and handle both training (with labels) and prediction (without labels) scenarios efficiently.

Leave a Reply