Last Updated on April 8, 2023
When you build and train a PyTorch deep learning model, you can provide the training data in several different ways. Ultimately, a PyTorch model works like a function that takes a PyTorch tensor and returns you another tensor. You have a lot of freedom in how to get the input tensors. Probably the easiest is to prepare a large tensor of the entire dataset and extract a small batch from it in each training step. But you will see that using the DataLoader
can save you a few lines of code in dealing with data.
In this post, you will see how you can use the the Data and DataLoader in PyTorch. After finishing this post, you will learn:
- How to create and use DataLoader to train your PyTorch model
- How to use Data class to generate data on the fly
Kick-start your project with my book Deep Learning with PyTorch. It provides self-study tutorials with working code.
Let’s get started.

Training a PyTorch Model with DataLoader and Dataset
Photo by Emmanuel Appiah. Some rights reserved.
Overview
This post is divided into three parts; they are:
- What is
DataLoader
? - Using
DataLoader
in a Training Loop
What is DataLoader
?
To train a deep learning model, you need data. Usually data is available as a dataset. In a dataset, there are a lot of data sample or instances. You can ask the model to take one sample at a time but usually you would let the model to process one batch of several samples. You may create a batch by extracting a slice from the dataset, using the slicing syntax on the tensor. For a better quality of training, you may also want to shuffle the entire dataset on each epoch so no two batch would be the same in the entire training loop. Sometimes, you may introduce data augmentation to manually introduce more variance to the data. This is common for image-related tasks, which you can randomly tilt or zoom the image a bit to generate a lot of data sample from a few images.
You can imagine there can be a lot of code to write to do all these. But it is much easier with the DataLoader
.
The following is an example of how create a DataLoader
and take a batch from it. In this example, the sonar dataset is used and ultimately, it is converted into PyTorch tensors and passed on to DataLoader
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
import pandas as pd import torch from torch.utils.data import Dataset, DataLoader from sklearn.preprocessing import LabelEncoder # Read data, convert to NumPy arrays data = pd.read_csv("sonar.csv", header=None) X = data.iloc[:, 0:60].values y = data.iloc[:, 60].values # encode class values as integers encoder = LabelEncoder() encoder.fit(y) y = encoder.transform(y) # convert into PyTorch tensors X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # create DataLoader, then take one batch loader = DataLoader(list(zip(X,y)), shuffle=True, batch_size=16) for X_batch, y_batch in loader: print(X_batch, y_batch) break |
You can see from the output of above that X_batch
and y_batch
are PyTorch tensors. The loader
is an instance of DataLoader
class which can work like an iterable. Each time you read from it, you get a batch of features and targets from the original dataset.
When you create a DataLoader
instance, you need to provide a list of sample pairs. Each sample pair is one data sample of feature and the corresponding target. A list is required because DataLoader
expect to use len()
to find the total size of the dataset and using array index to retrieve a particular sample. The batch size is a parameter to DataLoader
so it knows how to create a batch from the entire dataset. You should almost always use shuffle=True
so every time you load the data, the samples are shuffled. It is useful for training because in each epoch, you are going to read every batch once. When you proceed from one epoch to another, as DataLoader
knows you depleted all the batches, it will re-shuffle so you get a new combination of samples.
Want to Get Started With Deep Learning with PyTorch?
Take my free email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Using DataLoader
in a Training Loop
The following is an example to make use of DataLoader
in a training loop:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
import torch import torch.nn as nn import torch.optim as optim from sklearn.model_selection import train_test_split # train-test split for evaluation of the model X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True) # set up DataLoader for training set loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16) # create model model = nn.Sequential( nn.Linear(60, 60), nn.ReLU(), nn.Linear(60, 30), nn.ReLU(), nn.Linear(30, 1), nn.Sigmoid() ) # Train the model n_epochs = 200 loss_fn = nn.BCELoss() optimizer = optim.SGD(model.parameters(), lr=0.1) model.train() for epoch in range(n_epochs): for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # evaluate accuracy after training model.eval() y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() acc = float(acc) print("Model accuracy: %.2f%%" % (acc*100)) |
You can see that once you created the DataLoader
instance, the training loop can only be easier. In the above, only the training set is packaged with a DataLoader
because you need to loop through it in batches. You can also create a DataLoader
for the test set and use it for model evaluation, but since the accuracy is computed over the entire test set rather than in a batch, the benefit of DataLoader
is not significant.
Putting everything together, below is the complete code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import DataLoader from sklearn.preprocessing import LabelEncoder from sklearn.model_selection import train_test_split # Read data, convert to NumPy arrays data = pd.read_csv("sonar.csv", header=None) X = data.iloc[:, 0:60].values y = data.iloc[:, 60].values # encode class values as integers encoder = LabelEncoder() encoder.fit(y) y = encoder.transform(y) # convert into PyTorch tensors X = torch.tensor(X, dtype=torch.float32) y = torch.tensor(y, dtype=torch.float32).reshape(-1, 1) # train-test split for evaluation of the model X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True) # set up DataLoader for training set loader = DataLoader(list(zip(X_train, y_train)), shuffle=True, batch_size=16) # create model model = nn.Sequential( nn.Linear(60, 60), nn.ReLU(), nn.Linear(60, 30), nn.ReLU(), nn.Linear(30, 1), nn.Sigmoid() ) # Train the model n_epochs = 200 loss_fn = nn.BCELoss() optimizer = optim.SGD(model.parameters(), lr=0.1) model.train() for epoch in range(n_epochs): for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # evaluate accuracy after training model.eval() y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() acc = float(acc) print("Model accuracy: %.2f%%" % (acc*100)) |
Create Data Iterator using Dataset
Class
In PyTorch, there is a Dataset
class that can be tightly coupled with the DataLoader
class. Recall that DataLoader
expects its first argument can work with len()
and with array index. The Dataset
class is a base class for this. The reason you may want to use Dataset
class is there are some special handling before you can get the data sample. For example, data should be read from database or disk and you only want to keep a few samples in memory rather than prefetch everything. Another example is to perform real-time preprocessing of data, such as random augmentation that is common in image tasks.
To use Dataset
class, you just subclass from it and implement two member functions. Below is an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from torch.utils.data import Dataset class SonarDataset(Dataset): def __init__(self, X, y): # convert into PyTorch tensors and remember them self.X = torch.tensor(X, dtype=torch.float32) self.y = torch.tensor(y, dtype=torch.float32) def __len__(self): # this should return the size of the dataset return len(self.X) def __getitem__(self, idx): # this should return one sample from the dataset features = self.X[idx] target = self.y[idx] return features, target |
This is not the most powerful way to use Dataset
but simple enough to demonstrate how it works. With this, you can create a DataLoader
and use it for model training. Modifying from the previous example, you have the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
... # set up DataLoader for training set dataset = SonarDataset(X_train, y_train) loader = DataLoader(dataset, shuffle=True, batch_size=16) # create model model = nn.Sequential( nn.Linear(60, 60), nn.ReLU(), nn.Linear(60, 30), nn.ReLU(), nn.Linear(30, 1), nn.Sigmoid() ) # Train the model n_epochs = 200 loss_fn = nn.BCELoss() optimizer = optim.SGD(model.parameters(), lr=0.1) model.train() for epoch in range(n_epochs): for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # evaluate accuracy after training model.eval() y_pred = model(torch.tensor(X_test, dtype=torch.float32)) y_test = torch.tensor(y_test, dtype=torch.float32) acc = (y_pred.round() == y_test).float().mean() acc = float(acc) print("Model accuracy: %.2f%%" % (acc*100)) |
You set up dataset
as an instance of SonarDataset
which you implemented the __len__()
and __getitem__()
functions. This is used in place of the list in the previous example to set up the DataLoader
instance. Afterward, everything is the same in the training loop. Note that you still use PyTorch tensors directly for the test set in the example.
In the __getitem__()
function, you take an integer that works like an array index and returns a pair, the features and the target. You can implement anything in this function: Run some code to generate a synthetic data sample, read data on the fly from the internet, or add random variations to the data. You will also find it useful in the situation that you cannot keep the entire dataset in memory, so you can load only the data samples that you need it.
In fact, since you created a PyTorch dataset, you don’t need to use scikit-learn to split data into training set and test set. In torch.utils.data
submodule, you have a function random_split()
that works with Dataset
class for the same purpose. A full example is below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 |
import pandas as pd import torch import torch.nn as nn import torch.optim as optim from torch.utils.data import Dataset, DataLoader, random_split, default_collate from sklearn.preprocessing import LabelEncoder # Read data, convert to NumPy arrays data = pd.read_csv("sonar.csv", header=None) X = data.iloc[:, 0:60].values y = data.iloc[:, 60].values # encode class values as integers encoder = LabelEncoder() encoder.fit(y) y = encoder.transform(y).reshape(-1, 1) class SonarDataset(Dataset): def __init__(self, X, y): # convert into PyTorch tensors and remember them self.X = torch.tensor(X, dtype=torch.float32) self.y = torch.tensor(y, dtype=torch.float32) def __len__(self): # this should return the size of the dataset return len(self.X) def __getitem__(self, idx): # this should return one sample from the dataset features = self.X[idx] target = self.y[idx] return features, target # set up DataLoader for data set dataset = SonarDataset(X, y) trainset, testset = random_split(dataset, [0.7, 0.3]) loader = DataLoader(trainset, shuffle=True, batch_size=16) # create model model = nn.Sequential( nn.Linear(60, 60), nn.ReLU(), nn.Linear(60, 30), nn.ReLU(), nn.Linear(30, 1), nn.Sigmoid() ) # Train the model n_epochs = 200 loss_fn = nn.BCELoss() optimizer = optim.SGD(model.parameters(), lr=0.1) model.train() for epoch in range(n_epochs): for X_batch, y_batch in loader: y_pred = model(X_batch) loss = loss_fn(y_pred, y_batch) optimizer.zero_grad() loss.backward() optimizer.step() # create one test tensor from the testset X_test, y_test = default_collate(testset) model.eval() y_pred = model(X_test) acc = (y_pred.round() == y_test).float().mean() acc = float(acc) print("Model accuracy: %.2f%%" % (acc*100)) |
It is very similar to the example you have before. Beware that the PyTorch model still needs a tensor as input, not a Dataset
. Hence in the above, you need to use the default_collate()
function to collect samples from a dataset into tensors.
Further Readings
This section provides more resources on the topic if you are looking to go deeper.
- torch.utils.data from PyTorch documentation
- Datasets and DataLoaders from PyTorch tutorial
Summary
In this post, you learned how to use DataLoader
to create shuffled batches of data and how to use Dataset
to provide data samples. Specifically you learned:
DataLoader
as a convenient way of providing batches of data to the training loop- How to use
Dataset
to produce data samples - How combine
Dataset
andDataLoader
to generate batches of data on the fly for model training
Hi, is shuffling appropriate for forecasting problems
Usually no. It sounds like you’re talking about a time series problem and we do not want to lose the time ordering information. Therefore, shuffling is not recommended. But you can transform a time series into windows and shuffle the windows. Hope that helps.
In the intro you mentioned that pytorch models can handle taking a large tensor of data. Does that mean I can load in the MNIST dataset, set a x_train and y_train of all the training data as tensors and train like this?
for epoch in range(20):
print(“epoch:” + str(epoch))
model.train() # puts the model in training mode
y_pred = model(X_train) # Forward pass
loss_calc = loss_func(y_pred, y_train)
optimizer.zero_grad()
loss_calc.backward()
optimizer.step()
I’m a little confused on what dataloader actually does. The loader iterates over the data in batches and will feed the batch to the model. Is that saying that its loading these batches into memory where each batch can be thought of a small tensor of data that is being fed into the model? Much like how we can feed a whole tensor of data in my opening statement, this is feeding a tensor of a batch.