Training a PyTorch Model with DataLoader and Dataset

By Adrian Tam on April 8, 2023 in Deep Learning with PyTorch 5

When you build and train a PyTorch deep learning model, you can provide the training data in several different ways. Ultimately, a PyTorch model works like a function that takes a PyTorch tensor and returns you another tensor. You have a lot of freedom in how to get the input tensors. Probably the easiest is to prepare a large tensor of the entire dataset and extract a small batch from it in each training step. But you will see that using the DataLoader can save you a few lines of code in dealing with data.

In this post, you will see how you can use the the Data and DataLoader in PyTorch. After finishing this post, you will learn:

How to create and use DataLoader to train your PyTorch model
How to use Data class to generate data on the fly

Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.

Let’s get started.

Training a PyTorch Model with DataLoader and Dataset
Photo by Emmanuel Appiah. Some rights reserved.

Overview

This post is divided into three parts; they are:

What is DataLoader?
Using DataLoader in a Training Loop

What is `DataLoader`?

To train a deep learning model, you need data. Usually data is available as a dataset. In a dataset, there are a lot of data sample or instances. You can ask the model to take one sample at a time but usually you would let the model to process one batch of several samples. You may create a batch by extracting a slice from the dataset, using the slicing syntax on the tensor. For a better quality of training, you may also want to shuffle the entire dataset on each epoch so no two batch would be the same in the entire training loop. Sometimes, you may introduce data augmentation to manually introduce more variance to the data. This is common for image-related tasks, which you can randomly tilt or zoom the image a bit to generate a lot of data sample from a few images.

You can imagine there can be a lot of code to write to do all these. But it is much easier with the DataLoader.

The following is an example of how create a DataLoader and take a batch from it. In this example, the sonar dataset is used and ultimately, it is converted into PyTorch tensors and passed on to DataLoader:

import pandas as pd
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder

# Read data, convert to NumPy arrays
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60].values
y = data.iloc[:, 60].values

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y)

# convert into PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# create DataLoader, then take one batch
loader = DataLoader(list(zip(X,y)), shuffle=True, batch_size=16)
for X_batch, y_batch in loader:
    print(X_batch, y_batch)
    break

import pandas as pd

import torch

from torch.utils.data import Dataset, DataLoader

from sklearn.preprocessing import LabelEncoder

# Read data, convert to NumPy arrays

data = pd.read_csv("sonar.csv", header=None)

X = data.iloc[:, 0:60].values

y = data.iloc[:, 60].values

# encode class values as integers

encoder = LabelEncoder()

encoder.fit(y)

y = encoder.transform(y)

# convert into PyTorch tensors

X = torch.tensor(X, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# create DataLoader, then take one batch

loader = DataLoader(list(zip(X,y)), shuffle=True, batch_size=16)

for X_batch, y_batch in loader:

print(X_batch, y_batch)

break

You can see from the output of above that X_batch and y_batch are PyTorch tensors. The loader is an instance of DataLoader class which can work like an iterable. Each time you read from it, you get a batch of features and targets from the original dataset.

When you create a DataLoader instance, you need to provide a list of sample pairs. Each sample pair is one data sample of feature and the corresponding target. A list is required because DataLoader expect to use len() to find the total size of the dataset and using array index to retrieve a particular sample. The batch size is a parameter to DataLoader so it knows how to create a batch from the entire dataset. You should almost always use shuffle=True so every time you load the data, the samples are shuffled. It is useful for training because in each epoch, you are going to read every batch once. When you proceed from one epoch to another, as DataLoader knows you depleted all the batches, it will re-shuffle so you get a new combination of samples.

Want to Get Started With Deep Learning with PyTorch?

Take my free email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Using `DataLoader` in a Training Loop

The following is an example to make use of DataLoader in a training loop:

import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split

# train-test split for evaluation of the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)

# set up DataLoader for training set
loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16)

# create model
model = nn.Sequential(
    nn.Linear(60, 60),
    nn.ReLU(),
    nn.Linear(60, 30),
    nn.ReLU(),
    nn.Linear(30, 1),
    nn.Sigmoid()
)

# Train the model
n_epochs = 200
loss_fn = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
model.train()
for epoch in range(n_epochs):
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# evaluate accuracy after training
model.eval()
y_pred = model(X_test)
acc = (y_pred.round() == y_test).float().mean()
acc = float(acc)
print("Model accuracy: %.2f%%" % (acc*100))

import torch

import torch.nn as nn

import torch.optim as optim

from sklearn.model_selection import train_test_split

# train-test split for evaluation of the model

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)

# set up DataLoader for training set

loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16)

# create model

model = nn.Sequential(

nn.Linear(60, 60),

nn.ReLU(),

nn.Linear(60, 30),

nn.ReLU(),

nn.Linear(30, 1),

nn.Sigmoid()

)

# Train the model

n_epochs = 200

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

model.train()

for epoch in range(n_epochs):

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# evaluate accuracy after training

model.eval()

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

acc = float(acc)

print("Model accuracy: %.2f%%" % (acc*100))

You can see that once you created the DataLoader instance, the training loop can only be easier. In the above, only the training set is packaged with a DataLoader because you need to loop through it in batches. You can also create a DataLoader for the test set and use it for model evaluation, but since the accuracy is computed over the entire test set rather than in a batch, the benefit of DataLoader is not significant.

Putting everything together, below is the complete code.

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Read data, convert to NumPy arrays
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60].values
y = data.iloc[:, 60].values

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y)

# convert into PyTorch tensors
X = torch.tensor(X, dtype=torch.float32)
y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# train-test split for evaluation of the model
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)

# set up DataLoader for training set
loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16)

# create model
model = nn.Sequential(
    nn.Linear(60, 60),
    nn.ReLU(),
    nn.Linear(60, 30),
    nn.ReLU(),
    nn.Linear(30, 1),
    nn.Sigmoid()
)

# Train the model
n_epochs = 200
loss_fn = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
model.train()
for epoch in range(n_epochs):
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# evaluate accuracy after training
model.eval()
y_pred = model(X_test)
acc = (y_pred.round() == y_test).float().mean()
acc = float(acc)
print("Model accuracy: %.2f%%" % (acc*100))

import pandas as pd

import torch

import torch.nn as nn

import torch.optim as optim

from torch.utils.data import DataLoader

from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

# Read data, convert to NumPy arrays

data = pd.read_csv("sonar.csv", header=None)

X = data.iloc[:, 0:60].values

y = data.iloc[:, 60].values

# encode class values as integers

encoder = LabelEncoder()

encoder.fit(y)

y = encoder.transform(y)

# convert into PyTorch tensors

X = torch.tensor(X, dtype=torch.float32)

y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1)

# train-test split for evaluation of the model

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True)

# set up DataLoader for training set

loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16)

# create model

model = nn.Sequential(

nn.Linear(60, 60),

nn.ReLU(),

nn.Linear(60, 30),

nn.ReLU(),

nn.Linear(30, 1),

nn.Sigmoid()

)

# Train the model

n_epochs = 200

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

model.train()

for epoch in range(n_epochs):

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# evaluate accuracy after training

model.eval()

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

acc = float(acc)

print("Model accuracy: %.2f%%" % (acc*100))

Create Data Iterator using `Dataset` Class

In PyTorch, there is a Dataset class that can be tightly coupled with the DataLoader class. Recall that DataLoader expects its first argument can work with len() and with array index. The Dataset class is a base class for this. The reason you may want to use Dataset class is there are some special handling before you can get the data sample. For example, data should be read from database or disk and you only want to keep a few samples in memory rather than prefetch everything. Another example is to perform real-time preprocessing of data, such as random augmentation that is common in image tasks.

To use Dataset class, you just subclass from it and implement two member functions. Below is an example:

from torch.utils.data import Dataset

class SonarDataset(Dataset):
    def __init__(self, X, y):
        # convert into PyTorch tensors and remember them
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        # this should return the size of the dataset
        return len(self.X)

    def __getitem__(self, idx):
        # this should return one sample from the dataset
        features = self.X[idx]
        target = self.y[idx]
        return features, target

from torch.utils.data import Dataset

class SonarDataset(Dataset):

def __init__(self, X, y):

# convert into PyTorch tensors and remember them

self.X = torch.tensor(X, dtype=torch.float32)

self.y = torch.tensor(y, dtype=torch.float32)

def __len__(self):

# this should return the size of the dataset

return len(self.X)

def __getitem__(self, idx):

# this should return one sample from the dataset

features = self.X[idx]

target = self.y[idx]

return features, target

This is not the most powerful way to use Dataset but simple enough to demonstrate how it works. With this, you can create a DataLoader and use it for model training. Modifying from the previous example, you have the following:

...

# set up DataLoader for training set
dataset = SonarDataset(X_train, y_train)
loader = DataLoader(dataset, shuffle=True, batch_size=16)

# create model
model = nn.Sequential(
    nn.Linear(60, 60),
    nn.ReLU(),
    nn.Linear(60, 30),
    nn.ReLU(),
    nn.Linear(30, 1),
    nn.Sigmoid()
)

# Train the model
n_epochs = 200
loss_fn = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
model.train()
for epoch in range(n_epochs):
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# evaluate accuracy after training
model.eval()
y_pred = model(torch.tensor(X_test, dtype=torch.float32))
y_test = torch.tensor(y_test, dtype=torch.float32)
acc = (y_pred.round() == y_test).float().mean()
acc = float(acc)
print("Model accuracy: %.2f%%" % (acc*100))

...

# set up DataLoader for training set

dataset = SonarDataset(X_train, y_train)

loader = DataLoader(dataset, shuffle=True, batch_size=16)

# create model

model = nn.Sequential(

nn.Linear(60, 60),

nn.ReLU(),

nn.Linear(60, 30),

nn.ReLU(),

nn.Linear(30, 1),

nn.Sigmoid()

)

# Train the model

n_epochs = 200

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

model.train()

for epoch in range(n_epochs):

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# evaluate accuracy after training

model.eval()

y_pred = model(torch.tensor(X_test, dtype=torch.float32))

y_test = torch.tensor(y_test, dtype=torch.float32)

acc = (y_pred.round() == y_test).float().mean()

acc = float(acc)

print("Model accuracy: %.2f%%" % (acc*100))

You set up dataset as an instance of SonarDataset which you implemented the __len__() and __getitem__() functions. This is used in place of the list in the previous example to set up the DataLoader instance. Afterward, everything is the same in the training loop. Note that you still use PyTorch tensors directly for the test set in the example.

In the __getitem__() function, you take an integer that works like an array index and returns a pair, the features and the target. You can implement anything in this function: Run some code to generate a synthetic data sample, read data on the fly from the internet, or add random variations to the data. You will also find it useful in the situation that you cannot keep the entire dataset in memory, so you can load only the data samples that you need it.

In fact, since you created a PyTorch dataset, you don’t need to use scikit-learn to split data into training set and test set. In torch.utils.data submodule, you have a function random_split() that works with Dataset class for the same purpose. A full example is below:

import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader, random_split, default_collate
from sklearn.preprocessing import LabelEncoder

# Read data, convert to NumPy arrays
data = pd.read_csv("sonar.csv", header=None)
X = data.iloc[:, 0:60].values
y = data.iloc[:, 60].values

# encode class values as integers
encoder = LabelEncoder()
encoder.fit(y)
y = encoder.transform(y).reshape(-1, 1)

class SonarDataset(Dataset):
    def __init__(self, X, y):
        # convert into PyTorch tensors and remember them
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.float32)

    def __len__(self):
        # this should return the size of the dataset
        return len(self.X)

    def __getitem__(self, idx):
        # this should return one sample from the dataset
        features = self.X[idx]
        target = self.y[idx]
        return features, target

# set up DataLoader for data set
dataset = SonarDataset(X, y)
trainset, testset = random_split(dataset, [0.7, 0.3])
loader = DataLoader(trainset, shuffle=True, batch_size=16)

# create model
model = nn.Sequential(
    nn.Linear(60, 60),
    nn.ReLU(),
    nn.Linear(60, 30),
    nn.ReLU(),
    nn.Linear(30, 1),
    nn.Sigmoid()
)

# Train the model
n_epochs = 200
loss_fn = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
model.train()
for epoch in range(n_epochs):
    for X_batch, y_batch in loader:
        y_pred = model(X_batch)
        loss = loss_fn(y_pred, y_batch)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

# create one test tensor from the testset
X_test, y_test = default_collate(testset)
model.eval()
y_pred = model(X_test)
acc = (y_pred.round() == y_test).float().mean()
acc = float(acc)
print("Model accuracy: %.2f%%" % (acc*100))

import pandas as pd

import torch

import torch.nn as nn

import torch.optim as optim

from torch.utils.data import Dataset, DataLoader, random_split, default_collate

from sklearn.preprocessing import LabelEncoder

# Read data, convert to NumPy arrays

data = pd.read_csv("sonar.csv", header=None)

X = data.iloc[:, 0:60].values

y = data.iloc[:, 60].values

# encode class values as integers

encoder = LabelEncoder()

encoder.fit(y)

y = encoder.transform(y).reshape(-1, 1)

class SonarDataset(Dataset):

def __init__(self, X, y):

# convert into PyTorch tensors and remember them

self.X = torch.tensor(X, dtype=torch.float32)

self.y = torch.tensor(y, dtype=torch.float32)

def __len__(self):

# this should return the size of the dataset

return len(self.X)

def __getitem__(self, idx):

# this should return one sample from the dataset

features = self.X[idx]

target = self.y[idx]

return features, target

# set up DataLoader for data set

dataset = SonarDataset(X, y)

trainset, testset = random_split(dataset, [0.7, 0.3])

loader = DataLoader(trainset, shuffle=True, batch_size=16)

# create model

model = nn.Sequential(

nn.Linear(60, 60),

nn.ReLU(),

nn.Linear(60, 30),

nn.ReLU(),

nn.Linear(30, 1),

nn.Sigmoid()

)

# Train the model

n_epochs = 200

loss_fn = nn.BCELoss()

optimizer = optim.SGD(model.parameters(), lr=0.1)

model.train()

for epoch in range(n_epochs):

for X_batch, y_batch in loader:

y_pred = model(X_batch)

loss = loss_fn(y_pred, y_batch)

optimizer.zero_grad()

loss.backward()

optimizer.step()

# create one test tensor from the testset

X_test, y_test = default_collate(testset)

model.eval()

y_pred = model(X_test)

acc = (y_pred.round() == y_test).float().mean()

acc = float(acc)

print("Model accuracy: %.2f%%" % (acc*100))

It is very similar to the example you have before. Beware that the PyTorch model still needs a tensor as input, not a Dataset. Hence in the above, you need to use the default_collate() function to collect samples from a dataset into tensors.

Summary

In this post, you learned how to use DataLoader to create shuffled batches of data and how to use Dataset to provide data samples. Specifically you learned:

DataLoader as a convenient way of providing batches of data to the training loop
How to use Dataset to produce data samples
How combine Dataset and DataLoader to generate batches of data on the fly for model training

5 Responses to Training a PyTorch Model with DataLoader and Dataset

Aditya February 26, 2023 at 12:32 am #

Hi, is shuffling appropriate for forecasting problems

Reply
- Adrian Tam March 15, 2023 at 5:43 am #
  
  Usually no. It sounds like you’re talking about a time series problem and we do not want to lose the time ordering information. Therefore, shuffling is not recommended. But you can transform a time series into windows and shuffle the windows. Hope that helps.
  
  Reply
Matthew Avaylon August 8, 2023 at 2:52 am #

In the intro you mentioned that pytorch models can handle taking a large tensor of data. Does that mean I can load in the MNIST dataset, set a x_train and y_train of all the training data as tensors and train like this?

for epoch in range(20):

print(“epoch:” + str(epoch))

model.train() # puts the model in training mode

y_pred = model(X_train) # Forward pass

loss_calc = loss_func(y_pred, y_train)

optimizer.zero_grad()

loss_calc.backward()

optimizer.step()

I’m a little confused on what dataloader actually does. The loader iterates over the data in batches and will feed the batch to the model. Is that saying that its loading these batches into memory where each batch can be thought of a small tensor of data that is being fed into the model? Much like how we can feed a whole tensor of data in my opening statement, this is feeding a tensor of a batch.

Reply
Peggy June 13, 2024 at 12:08 pm #

For Multi-Instance Learning (MIL), my dataset includes unique IDs, features, and labels for training.

For prediction, I need to provide the dataset with unique IDs and features, but without labels.

Therefore, may I ask should modify my dataset class to handle data without labels during prediction?

Thanks!

Reply
- James Carmichael June 14, 2024 at 6:48 am #
  
  Hi Peggy…Yes, you should modify your dataset class to handle data without labels during prediction in Multi-Instance Learning (MIL). Typically, this involves creating a dataset class that can manage both training (with labels) and prediction (without labels) scenarios.
  
  Here’s a general approach to modifying your dataset class:
  
  ### 1. Define the Dataset Class
  You can create a dataset class that accepts data in both labeled and unlabeled forms. This class should be able to distinguish whether it’s being used for training or prediction based on the presence of labels.
  
  ### 2. Handling Different Scenarios
  You can add a parameter to indicate whether the dataset includes labels or not. If labels are not provided, the class should handle the data accordingly during prediction.
  
  ### Example in PyTorch
  
  Here’s a basic example in PyTorch to illustrate this:
  
  python import torch from torch.utils.data import Dataset
  class MILDataset(Dataset): def __init__(self, data, labels=None, mode='train'): """ Args: data (list or array-like): List of features or instances. labels (list or array-like, optional): List of labels corresponding to the data. Default is None. mode (str): Either 'train' or 'predict' to indicate the mode of operation. Default is 'train'. """ self.data = data self.labels = labels self.mode = mode def __len__(self): return len(self.data) def __getitem__(self, idx): if torch.is_tensor(idx): idx = idx.tolist() sample = self.data[idx] if self.mode == 'train': if self.labels is None: raise ValueError("Labels must be provided in training mode.") label = self.labels[idx] return sample, label elif self.mode == 'predict': return sample else: raise ValueError("Mode should be either 'train' or 'predict'.") # Example usage: # For training train_data = [[0.1, 0.2], [0.3, 0.4], [0.5, 0.6]] train_labels = [0, 1, 0] train_dataset = MILDataset(data=train_data, labels=train_labels, mode='train')
  # For prediction predict_data = [[0.7, 0.8], [0.9, 1.0]] predict_dataset = MILDataset(data=predict_data, mode='predict')
  
  ### Explanation
  – **Initialization (__init__ method)**:
  – The data parameter holds the features.
  – The labels parameter is optional and only required in training mode.
  – The mode parameter specifies whether the dataset is for training or prediction.
  
  – **Length (__len__ method)**:
  – Returns the number of instances in the dataset.
  
  – **Get Item (__getitem__ method)**:
  – If in training mode ('train'), it returns a tuple of (sample, label).
  – If in prediction mode ('predict'), it returns only the sample (feature vector).
  
  ### Using the Dataset with DataLoader
  You can use this dataset class with PyTorch’s DataLoader for both training and prediction:
  
  python from torch.utils.data import DataLoader
  # For training train_loader = DataLoader(train_dataset, batch_size=2, shuffle=True)
  # For prediction predict_loader = DataLoader(predict_dataset, batch_size=2, shuffle=False)
  
  This structure allows your dataset class to be flexible and handle both training (with labels) and prediction (without labels) scenarios efficiently.
  
  Reply

Navigation

Training a PyTorch Model with DataLoader and Dataset

Overview

What is `DataLoader`?

Want to Get Started With Deep Learning with PyTorch?

Using `DataLoader` in a Training Loop

Create Data Iterator using `Dataset` Class

Further Readings

Summary

Get Started on Deep Learning with PyTorch!

Learn how to build deep learning models

Kick-start your deep learning journey with hands-on exercises

More On This Topic

5 Responses to Training a PyTorch Model with DataLoader and Dataset

Leave a Reply Click here to cancel reply.

Navigation

Overview

What is DataLoader?

Want to Get Started With Deep Learning with PyTorch?

Using DataLoader in a Training Loop

Create Data Iterator using Dataset Class

Further Readings

Summary

Get Started on Deep Learning with PyTorch!

Learn how to build deep learning models

Kick-start your deep learning journey with hands-on exercises

More On This Topic

5 Responses to Training a PyTorch Model with DataLoader and Dataset

Leave a Reply Click here to cancel reply.

What is `DataLoader`?

Using `DataLoader` in a Training Loop

Create Data Iterator using `Dataset` Class