You must know how to load data before you can use it to train a machine learning model.
When starting out, it is a good idea to stick with small in-memory datasets using standard file formats like comma separated value (.csv).
In this tutorial you will discover how to load your data in Python from scratch, including:
- How to load a CSV file.
- How to convert strings from a file to floating point numbers.
- How to convert class values from a file to integers.
Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Update Nov/2016: Added an improved data loading function to skip empty lines.
- Update Aug/2018: Tested and updated to work with Python 3.6.

How to Load Machine Learning Data From Scratch In Python
Photo by Amanda B, some rights reserved.
Description
Comma Separated Values
The standard file format for small datasets is Comma Separated Values or CSV.
In it’s simplest form, CSV files are comprised of rows of data. Each row is divided into columns using a comma (“,”).
You can learn more about the CSV file format in RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.
In this tutorial, we are going to practice loading two different standard machine learning datasets in CSV format.
Pima Indians Diabetes Dataset
The first is the Pima Indians diabetes dataset. It contains 768 rows and 9 columns.
All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.
Iris Flower Species Dataset
The second dataset we will work with is the iris flowers dataset.
It contains 150 rows and 4 columns. The first 3 columns are numeric. It is different in that the class value (final column) is a string, indicating a species of flower. We will learn how to convert the numeric columns from string to numbers and how to convert the flower species string into an integer that we can use consistently.
Tutorial
This tutorial is divided into 3 parts:
- Load a file.
- Load a file and convert Strings to Floats.
- Load a file and convert Strings to Integers.
These steps will provide the foundations you need to handle loading your own data.
1. Load CSV File
The first step is to load the CSV file.
We will use the csv module that is a part of the standard library.
The reader() function in the csv module takes a file as an argument.
We will create a function called load_csv() to wrap this behavior that will take a filename and return our dataset. We will represent the loaded dataset as a list of lists. The first list is a list of observations or rows, and the second list is the list of column values for a given row.
Below is the complete function for loading a CSV file.
1 2 3 4 5 6 7 8 |
from csv import reader # Load a CSV file def load_csv(filename): file = open(filename, "r") lines = reader(file) dataset = list(lines) return dataset |
We can test this function by loading the Pima Indians dataset. Download the dataset and place it in the current working directory with the name pima-indians-diabetes.csv. Open the file and delete any empty lines at the bottom.
Taking a peek at the first 5 rows of the raw data file we can see the following:
1 2 3 4 5 |
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 |
The data is numeric and separated by commas and we can expect that the whole file meets this expectation.
Let’s use the new function and load the dataset. Once loaded we can report some simple details such as the number of rows and columns loaded.
Putting all of this together, we get the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
from csv import reader # Load a CSV file def load_csv(filename): file = open(filename, "r") lines = reader(file) dataset = list(lines) return dataset # Load dataset filename = 'pima-indians-diabetes.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0])) |
Running this example we see:
1 |
Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns |
A limitation of this function is that it will load empty lines from data files and add them to our list of rows. We can overcome this by adding rows of data one at a time to our dataset and skipping empty rows.
Below is the updated example with this new improved version of the load_csv() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
# Example of loading Pima Indians CSV dataset from csv import reader # Load a CSV file def load_csv(filename): dataset = list() with open(filename, 'r') as file: csv_reader = reader(file) for row in csv_reader: if not row: continue dataset.append(row) return dataset # Load dataset filename = 'pima-indians-diabetes.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0])) |
Running this example we see:
1 |
Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns |
2. Convert String to Floats
Most, if not all machine learning algorithms prefer to work with numbers.
Specifically, floating point numbers are preferred.
Our code for loading a CSV file returns a dataset as a list of lists, but each value is a string. We can see this if we print out one record from the dataset:
1 |
print(dataset[0]) |
This produces output like:
1 |
['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'] |
We can write a small function to convert specific columns of our loaded dataset to floating point values.
Below is this function called str_column_to_float(). It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value before making the conversion.
1 2 3 |
def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) |
We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values.
The complete example is below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from csv import reader # Load a CSV file def load_csv(filename): file = open(filename, "rb") lines = reader(file) dataset = list(lines) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Load pima-indians-diabetes dataset filename = 'pima-indians-diabetes.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0])) print(dataset[0]) # convert string columns to float for i in range(len(dataset[0])): str_column_to_float(dataset, i) print(dataset[0]) |
Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.
1 2 3 |
Loaded data file pima-indians-diabetes.csv with 768 rows and 9 columns ['6', '148', '72', '35', '0', '33.6', '0.627', '50', '1'] [6.0, 148.0, 72.0, 35.0, 0.0, 33.6, 0.627, 50.0, 1.0] |
3. Convert String to Integers
The iris flowers dataset is like the Pima Indians dataset, in that the columns contain numeric data.
The difference is the final column, traditionally used to hold the outcome or value to be predicted for a given row. The final column in the iris flowers data is the iris flower species as a string.
Download the dataset and place it in the current working directory with the file name iris.csv. Open the file and delete any empty lines at the bottom.
For example, below are the first 5 rows of the raw dataset.
1 2 3 4 5 |
5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa |
Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value.
We can convert the class value in the iris flowers dataset to an integer by creating a map.
- First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
- Next, we assign an integer value to each, such as: 0, 1 and 2.
- Finally, we replace all occurrences of class string values with their corresponding integer values.
Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.
1 2 3 4 5 6 7 8 9 10 |
# Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i for row in dataset: row[column] = lookup[row[column]] return lookup |
We can test this new function in addition to the previous two functions for loading a CSV file and converting columns to floating point values. It also returns the dictionary mapping of class values to integer values, in case any users downstream want to convert predictions back to string values again.
The example below loads the iris dataset then converts the first 3 columns to floats and the final column to integer values.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
from csv import reader # Load a CSV file def load_csv(filename): file = open(filename, "rb") lines = reader(file) dataset = list(lines) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float(row[column].strip()) # Convert string column to integer def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] unique = set(class_values) lookup = dict() for i, value in enumerate(unique): lookup[value] = i for row in dataset: row[column] = lookup[row[column]] return lookup # Load iris dataset filename = 'iris.csv' dataset = load_csv(filename) print('Loaded data file {0} with {1} rows and {2} columns').format(filename, len(dataset), len(dataset[0])) print(dataset[0]) # convert string columns to float for i in range(4): str_column_to_float(dataset, i) # convert class column to int lookup = str_column_to_int(dataset, 4) print(dataset[0]) print(lookup) |
Running this example produces the output below.
We can see the first row of the dataset before and after the data type conversions. We can also see the dictionary mapping of class values to integers.
1 2 3 4 |
Loaded data file iris.csv with 150 rows and 5 columns ['5.1', '3.5', '1.4', '0.2', 'Iris-setosa'] [5.1, 3.5, 1.4, 0.2, 1] {'Iris-virginica': 0, 'Iris-setosa': 1, 'Iris-versicolor': 2} |
Extensions
You learned how to load CSV files and perform basic data conversions.
Data loading can be a difficult task given the variety of data cleaning and conversion that may be required from problem to problem.
There are many extensions that you could make to make these examples more robust to new and different data files. Below are just a few ideas you can consider researching and implementing yourself:
- Detect and remove empty lines at the top or bottom of the file.
- Detect and handle missing values in a column.
- Detect and handle rows that do not match expectations for the rest of the file.
- Support for other delimiters such as “|” (pipe) or white space.
- Support more efficient data structures such as arrays.
Two libraries you may wish to use in practice for loading CSV data are NumPy and Pandas.
NumPy offers the loadtxt() function for loading data files as NumPy arrays. Pandas offers the read_csv() function that offers a lot of flexibility regarding data types, file headers and more.
Review
In this tutorial, you discovered how you can load your machine learning data from scratch in Python.
Specifically, you learned:
- How to load a CSV file into memory.
- How to convert string values to floating point values.
- How to convert a string class value into an integer encoding.
Do you have any questions about loading machine learning data or about this post?
Ask your question in the comments and I will do my best to answer.
Many thanks, could you please show us the best way to save numpy array and pandas data as cvs file.
Sorry, I do not have an example of saving data to a CSV.
Could you show us how to load our own image data to replace the mnist data for convolutional neural network in your Deep Learning with Python?
Hi Harold, great question. I will make time to prepare an example soon.
Thank you so much
Jason, I have a question. In the case that the dataset is not nicely organized into columns and rows (like how Iris, Pima D.S. are) but rather a random dump of strings, how can I convert the dataset so that the machine learning algorithms can recognize it? For example, I am trying to use machine learning algorithms to classify different malware log files. However, the log file has bunch of strings, symbols, as well as numbers. They are usually a haphazard collection of random queries (strings) that cannot be organized into columns, such as “sepal-length” or “class”, like above. Do you have any recommendations for me? I don’t know who else to ask..Thank you for your time.
Hi Matt, great question.
Maybe you can use feature engineering to pick out details and represent each observation using a row of binary or other variables:
https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
Maybe you can use natural language processing methods to work with the text directly or in a projected form.
I hope that helps as a start.
You’re welcome Matt.
Hi Jason
thanks for the examples:
could you help me with this?
when I run this section:
# convert string columns to float
for i in range(4):
str_column_to_float(dataset, i)
# convert class column to int
lookup = str_column_to_int(dataset, 4)
print(dataset[0])
print(lookup)
I get this error:
IndexError Traceback (most recent call last)
in ()
1 # convert string columns to float
2 for i in range(4):
—-> 3 str_column_to_float(dataset, i)
4 # convert class column to int
5 lookup = str_column_to_int(dataset, 4)
in str_column_to_float(dataset, column)
2 def str_column_to_float(dataset, column):
3 for row in dataset:
—-> 4 row[column] = float(row[column].strip())
IndexError: list index out of range
How do i fix that?
Thanks
Are you using Python 2.7?
No I’m using Python 3.5
The example was developed for Python 2.7, I hope to update it for Python 3 soon.
Thanks
I’ll keep trying to figure it out.
Hi Jason,
I have defined a load_csv function And typed following statements,
filename = ‘pima-indians-diabetes.csv’
dataset = load_csv(filename)
I get the following error, Please help me.
Syntax Error: invalid syntax
It sounds like a problem with your code.
Perhaps you have extra spaces?
Try running the code directly on the Python interpreter.
python2 import1.py
[‘5.1’, ‘3.5’, ‘1.4’, ‘0.2’, ‘Iris-setosa’]
Traceback (most recent call last):
File “import1.py”, line 32, in
str_column_to_float(dataset, i)
File “import1.py”, line 12, in str_column_to_float
row[column] = float(row[column].strip())
IndexError: list index out of range
why am I getting this error even though I’m using python2?
Perhaps double check that you copied all of the code without error?
hi Jason,
Thanks for your awesome post! Never fail to amaze me 🙂
Anyway, I’m having trouble converting my dataset that contained numeric and string values.
My dataset looks like this:
Timestamp ID Length Data
0.0000022 02a1 8 05 20 ea 0a 20 1a 00 7f
This is unsupervised learning and planning to adopt LSTM sequence classification in detecting anomalies in the dataset. No target variable. So when I want to convert them into floats, I got this error:
“ValueError: could not convert string to float: ’05 20 ea 0a 20 1a 00 7f'”
Is it because of spaces between them? Even the ‘ID’ rows were not converted to floats. I’m using Keras, my script looks like this:
>>> dataset = data.values
>>> dataset = dataset.astype(
>>> dataset = dataset.astype(‘float32’)
Thank you, I’m newbie new in machine learning. Looking forward for your reply Jason.
You will need to encode the text to numbers. You can use an integer encoder and/or a one hot encoder for label data or a bag of words or word embedding for real text data.
I have examples of all of these on the blog, try the search as a first step.
Thank you Jason for your quick response. I’ve found several of your topics in this blog concerning converting into floats stuff like you said. I will let you know once it’s working.
Thanks!
You’re welcome.
Hi Jason,
while running this code I’m getting an error
row[column] = float(row[column].strip())
ValueError: could not convert string to float: ‘7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6’
Could you please let me know the exact problem and how I can rectify the issue.
Looks like you have a semicolon separator instead of the expected comma.
Perhaps double check you have the right data file?
By the way, when printing result of the csv read in, the command should be:
print(‘Loaded data file {0} with {1} rows and {2} columns’.format(filename, len(dataset), len(dataset[0])))
, not:
print(‘Loaded data file {0} with {1} rows and {2} columns’).format(filename, len(dataset), len(dataset[0]))
The .format() applies to the string, not the print() function.
Thanks.
I did as follows and converted my data from string. But then when I apply other functions mentioned in the first tutorial page such as daatset.describe and dataset.head it says list is object and cannot be called.
How do I go about this?
Thanks
If you are new to Python, perhaps start with scikit-learn and Pandas instead of coding algorithms from scratch.
Here is a good place to start:
https://machinelearningmastery.com/start-here/#python
Hello Jason,
Great tutorial!
When I run the script convert strings to float I get the error message:
Error: iterator should return strings, not bytes (did you open the file in text mode?)
Am I doing something incorrect?
Hope to hear from you, thanks!
Regards, Rutger
Perhaps try opening the file in text model on your system?
https://docs.python.org/3/library/functions.html#open
Hello Jason!
after this conversion (e.g. str to int), is the data ready for spot check algorithms etc etc or not?
Do i have to do anything else before or after the conversion? like OneHot encoding..
Note that I have a dataset similar with iris dataset, but with one integer column and three string columns.
thank you in adnavce!
It depends on the data. E.g. some data may require you to encode categorical variables first, and you may need to spot-check data preparation methods in addition to algorithms.
Hello to everyone!!
I have a dataset with 3991 rows and 8 Columns with different data types as seen below:
15; 1215; FALSE; feed; 1; TRUE; TRUE; monument; attraction (CSV example)
and when I run the code to convert them, I am getting this:
Loaded data file INSTA POSTS.csv with 3991 rows and 1 columns.
Why does it display only 1 column while it has 8 columns?
I used your code several times and it worked great on similar datasets!
I don’t understand what happens now!
Any idea?
thank you in advance !
Faiy
CSV stands for comma separated values – you should not use semicolon