Last Updated on August 21, 2019
You must be able to load your data before you can start your machine learning project.
The most common format for machine learning data is CSV files. There are a number of ways to load a CSV file in Python.
In this post you will discover the different ways that you can use to load your machine learning data in Python.
Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Update March/2017: Change loading from binary (‘rb’) to ASCII (‘rt).
- Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
- Update March/2018: Updated NumPy load from URL example to work wth Python 3.

How To Load Machine Learning Data in Python
Photo by Ann Larie Valentine, some rights reserved.
Considerations When Loading CSV Data
There are a number of considerations when loading your machine learning data from CSV files.
For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV request for comment titled Common Format and MIME Type for Comma-Separated Values (CSV) Files.
CSV File Header
Does your data have a file header?
If so this can help in automatically assigning names to each column of data. If not, you may need to name your attributes manually.
Either way, you should explicitly specify whether or not your CSV file had a file header when loading your data.
Comments
Does your data have comments?
Comments in a CSV file are indicated by a hash (“#”) at the start of a line.
If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.
Delimiter
The standard delimiter that separates values in fields is the comma (“,”) character.
Your file could use a different delimiter like tab (“\t”) in which case you must specify it explicitly.
Quotes
Sometimes field values can have spaces. In these CSV files the values are often quoted.
The default quote character is the double quotation marks “\””. Other characters can be used, and you must specify the quote character used in your file.
Need help with Machine Learning in Python?
Take my free 2-week email course and discover data prep, algorithms and more (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Machine Learning Data Loading Recipes
Each recipe is standalone.
This means that you can copy and paste it into your project and use it immediately.
If you have any questions about these recipes or suggested improvements, please leave a comment and I will do my best to answer.
Load CSV with Python Standard Library
The Python API provides the module CSV and the function reader() that can be used to load CSV files.
Once loaded, you convert the CSV data to a NumPy array and use it for machine learning.
For example, you can download the Pima Indians dataset into your local directory (download from here).
All fields are numeric and there is no header line. Running the recipe below will load the CSV file and convert it to a NumPy array.
1 2 3 4 5 6 7 8 9 |
# Load CSV (using python) import csv import numpy filename = 'pima-indians-diabetes.data.csv' raw_data = open(filename, 'rt') reader = csv.reader(raw_data, delimiter=',', quoting=csv.QUOTE_NONE) x = list(reader) data = numpy.array(x).astype('float') print(data.shape) |
The example loads an object that can iterate over each row of the data and can easily be converted into a NumPy array. Running the example prints the shape of the array.
1 |
(768, 9) |
For more information on the csv.reader() function, see CSV File Reading and Writing in the Python API documentation.
Load CSV File With NumPy
You can load your CSV data using NumPy and the numpy.loadtxt() function.
This function assumes no header row and all data has the same format. The example below assumes that the file pima-indians-diabetes.data.csv is in your current working directory.
1 2 3 4 5 6 |
# Load CSV import numpy filename = 'pima-indians-diabetes.data.csv' raw_data = open(filename, 'rt') data = numpy.loadtxt(raw_data, delimiter=",") print(data.shape) |
Running the example will load the file as a numpy.ndarray and print the shape of the data:
1 |
(768, 9) |
This example can be modified to load the same dataset directly from a URL as follows:
Note: This example assumes you are using Python 3.
1 2 3 4 5 6 7 |
# Load CSV from URL using NumPy from numpy import loadtxt from urllib.request import urlopen url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv' raw_data = urlopen(url) dataset = loadtxt(raw_data, delimiter=",") print(dataset.shape) |
Again, running the example produces the same resulting shape of the data.
1 |
(768, 9) |
For more information on the numpy.loadtxt() function see the API documentation (version 1.10 of numpy).
Load CSV File With Pandas
You can load your CSV data using Pandas and the pandas.read_csv() function.
This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas.DataFrame that you can immediately start summarizing and plotting.
The example below assumes that the ‘pima-indians-diabetes.data.csv‘ file is in the current working directory.
1 2 3 4 5 6 |
# Load CSV using Pandas import pandas filename = 'pima-indians-diabetes.data.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(filename, names=names) print(data.shape) |
Note that in this example we explicitly specify the names of each attribute to the DataFrame. Running the example displays the shape of the data:
1 |
(768, 9) |
We can also modify this example to load CSV data directly from a URL.
1 2 3 4 5 6 |
# Load CSV using Pandas from URL import pandas url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv" names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] data = pandas.read_csv(url, names=names) print(data.shape) |
Again, running the example downloads the CSV file, parses it and displays the shape of the loaded DataFrame.
1 |
(768, 9) |
To learn more about the pandas.read_csv() function you can refer to the API documentation.
Summary
In this post you discovered how to load your machine learning data in Python.
You learned three specific techniques that you can use:
- Load CSV with Python Standard Library.
- Load CSV File With NumPy.
- Load CSV File With Pandas.
Your action step for this post is to type or copy-and-paste each recipe and get familiar with the different ways that you can load machine learning data in Python.
Do you have any questions about loading machine learning data in Python or about this post? Ask your question in the comments and I will do my best to answer it.
Hi!
What is meant here in section Load CSV with Python Standard Library. You can download the Pima Indians dataset into your local directory.
Where is my local directory?
I tried several ways, but it did not work
It means to download the CSV file to the directory where you are writing Python code. Your project’s current working directory.
Thank you, I got it now!
thanks budddy
You’re welcome.
thx
You are very welcome Anon!
For those using Anaconda, you can launch Jupiter notebook and upload data 0n the notebook, that being your working directory.
Thank you David for your recommendation!
hi
how can load video dataset in python?? without tensorflow, keras, …
I googled “python load video” and found this:
http://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_gui/py_video_display/py_video_display.html
Is it possible to store the dataset in E drive while my python files are in C drive?
I don’t think Python cares where you store files.
Hello,
I want to keep from a CSV file only two columns and use these numbers, as x-y points, for a k-means implementation that I am doing.
What I do now to generate my points is this:
” points = np.vstack(((np.random.randn(150, 2) * 0.75 + np.array([1, 0])),
(np.random.randn(50, 2) * 0.25 + np.array([-0.5, 0.5])),
(np.random.randn(50, 2) * 0.5 + np.array([-0.5, -0.5])))) “,
but I want to apply my code on actual data.
Any help?
Sorry, I don’t have any kmeans tutorials in Python. I may not be the best person to give you advice.
I don’t want anything about k-means, I have the code -computations and all- sorted out. I just want some help with the CSV files.
Thank you for explaining how to load data in detail.
They work perfectly.
I’m glad to hear it!
I’m glad it helped Steve.
Thanks you very much…really helpful…
I’m glad to hear that Fawad.
how to load text attribute ? I got error saying could not convert string to float: b’Iris-setosa’
You will need to load the data using Pandas then convert it to numbers.
I give examples of this.
I was just wondering what the best practices are for converting something in a Relational Database model to an optimal ML format for fields that could be redundant. Ideally the export would be in CSV, but I know it won’t be as simple as an export every time. Hopefully simple example to illustrate my question: Say I have a table where I attribute things to an animal. The structure could be set up similarly to this:
ID, Animal, Color,Continent
1,Zebra,Black,Africa
2,Zebra,White,Africa
With the goal of being able to say “If the color is black and white and lives in Africa, it’s probably a zebra.” …so each line represents the animal with a single color associated with it, and other fields as well. Would this type of format be a best practice to feed into the model as is? Or, would it make more sense to concatenate the colors into one line with a delimiter? In other words, it may not always be a 1:1 relationship, and in cases where the dataset is like that, what’s the best way of formatting?
Thanks for your time.
Great question. There are no hard rules, generally, I would recommend exploring as many representations as you can think of and see what works best.
This post might help to give you some ideas:
https://machinelearningmastery.mystagingwebsite.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
can you tell me how to select features from a csv file
Load the file and use feature selection algorithms:
https://machinelearningmastery.mystagingwebsite.com/feature-selection-in-python-with-scikit-learn/
Hey,
I am trying to load a line separated data.
name:disha
gender:female
majors:computer science
name:
gender:
majors:
Any advice on this?
Ouch, looks like you might need to write some custom code to load each “line” or entity.
can you tell me how to load a csv file and apply feature selection methods?? can you post code for grey wolf optimizer algorithm??
Yes, see this post:
https://machinelearningmastery.mystagingwebsite.com/feature-selection-in-python-with-scikit-learn/
I have loaded the data into numpy array. What is the next thing that i should do to train my model?
Follow this process:
https://machinelearningmastery.mystagingwebsite.com/start-here/#process
Hey,
I want to use KDD cup 99 dataset for the intrusion detection project. The dataset consist of String & numerical data. So should I convert entire dataset into numeric data or should I use it as it is?
Eventually all data will need to be numbers.
Hey Jason,
I have a dataset in csv which has header and all the columns have different datatype,
which one would be better to use in this scenario: loadtxt() or genfromtxt().
Also, is there any major performance difference in these 2 methods?
Use whatever you can, consider benchmarking the approaches with your data if speed is an issue.
I got a ValueError: could not convert string to float
while reading this data :
http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data
Can you please reply where I am doing wrong?
You might have some “?” values. Convert them to 0 or nan first.
filename = ‘C:\Users\user\Desktop\python.data.csv’
raw_data = open(filename, ‘rt’)
names = [‘pixle1’, ‘pixle2’, ‘pixle3’, ‘pixle4’, ‘pixle5’, ‘pixle6’, ‘pixle7’, ‘pixle8’, ‘pixle9’, ‘pixle10’, ‘pixle11’, ‘pixle12’, ‘pixle13’, ‘pixle14’, ‘pixle15’, ‘pixle16’, ‘pixle17’, ‘pixle18’, ‘pixle19’, ‘pixle20’, ‘pixle21’, ‘pixle22’, ‘pixle23’, ‘pixle24’, ‘pixle25’, ‘pixle26’, ‘pixle27’, ‘pixle28’, ‘pixle29’, ‘pixle30’, ‘class’]
data = numpy.loadtxt(raw_data, names= names)
Well done!
I have multiple csv files of varying sizes that I want to use for training my neural network. I have around 1000 files ranging from about 15000 to 65000 rows of data. After I preprocess some of this data, one csv may be around 65000 rows by 20 columns array. My computer starts running out of memory very quickly on just 1 of the 65000 by 20 arrays, so I cannot combine all the 1000 files into one large csv file. Is there a way using keras to load one of the csv files, have the model learn on that data, then load the next file, have the file learn on that, and so on? Is there a better way to learn on so much data?
I have a few ideas here:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/how-to-i-work-with-a-very-large-dataset
I have multiple 200 CSV files and labels files that contains 200 rows as output. I want to train, but unable to load the dataset
You may have to write come custom code to load each CSV in turn. E.g. in a loop over the files in the directory.
I got the error:
Traceback (most recent call last):
File “sum.py”, line 8, in
data= numpy.array(x).astype(float)
ValueError: setting an array element with a sequence.
why?
It suggests that x is not an array or a list.
Hello,
I have a dataset which contains numbers like this: 3,6e+12, 2.5e-3…
when reading this dataset as a CSV file, I get the error: “Value error: cannot convert string to float”
Any solution please??
The numbers are in scientific notation and will be read correctly.
Perhaps there are other non-number fields in the file?
No, there aren’t, and the error says :” cannot covert string to float in 3.6e+12″
thank you
That is surprising, perhaps try a different approach to loading, e.g. numpy or pandas?
Perhaps try posting to stackoverflow?
I’ll try ,
thanks
Sir,
Suppose i have 3 csv files , each having a particular attribute in it. So a single row in the 3 csv file correspond to a particular feature instance. So during the loading time can i load all the csv file together and convert each row into numpy array,
thanks
I recommend loading all data into memory then perhaps concatenate the numpy arrays (e.g. hstack).
If I have a data set with .data file extention how can I deal with it in python?
please help
Perhaps use a text editor to open it and confirm it is in CVS format, then open it in Python as though it were a CSV file.
I copy your codes as follows:
# Load CSV using NumPy
# You can load your CSV data using NumPy and the numpy.loadtxt() function.
import numpy
filename = ‘pima-indians-diabetes.csv’
raw_data = open(filename, ‘rt’)
data = numpy.loadtxt(raw_data, delimiter=”,”)
print(data.shape)
===============
However, I got an error message
ValueError Traceback (most recent call last)
in
5 filename = ‘pima-indians-diabetes.csv’
6 raw_data = open(filename, ‘rt’)
—-> 7 data = numpy.loadtxt(raw_data, delimiter=”,”)
8 print(data.shape)
~\Anaconda3\lib\site-packages\numpy\lib\npyio.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack, ndmin, encoding)
1099 # converting the data
1100 X = None
-> 1101 for x in read_data(_loadtxt_chunksize):
1102 if X is None:
1103 X = np.array(x, dtype)
~\Anaconda3\lib\site-packages\numpy\lib\npyio.py in read_data(chunk_size)
1026
1027 # Convert each value according to its column and store
-> 1028 items = [conv(val) for (conv, val) in zip(converters, vals)]
1029
1030 # Then pack it according to the dtype’s nesting
~\Anaconda3\lib\site-packages\numpy\lib\npyio.py in (.0)
1026
1027 # Convert each value according to its column and store
-> 1028 items = [conv(val) for (conv, val) in zip(converters, vals)]
1029
1030 # Then pack it according to the dtype’s nesting
~\Anaconda3\lib\site-packages\numpy\lib\npyio.py in floatconv(x)
744 if ‘0x’ in x:
745 return float.fromhex(x)
–> 746 return float(x)
747
748 typ = dtype.type
ValueError: could not convert string to float: ‘Pregnancies’
========
I do not know what is wrong.
I’m sorry to hear that, I have some suggestions for you here:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
how to load the dataset from the working directory to colab?
Sorry, I have not used colab.
When I click the “update: download from here” to download the CSV file, it takes me to a white page with number on the left side which looks to be the data. How do I get / download this data into a CSV file? Thanks!
Here is the direct link:
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
Thank you!
Hi Jason,
I hope you can help me with the following preprocessed dataset.txt file. How can I load this dataset in python? It contains a total of 54,256 rows and 28 columns. Can I use pandas?
[0.08148002361739815, 3.446134970078908e-05, 4.747197881944017e-05, 0.0034219001610305954, 0.047596616392169624, 0.11278174138979659, 0.0011501307441196414, 1.0, 0.09648950774661698, 0.09152382450070766, 0.0032736389720705384, 0.02231715511892242, 0.0, -1.0, 0.0, -1.0, -1.0, -1.0, 0.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, -1.0, 1.0, -1.0]
[0.0816768352686479, 2.929466010613462e-05, 1.2086789450560964e-06, 0.6246987951807229, 0.04743433880824845, 0.11350265074251698, 0.0011614423285977043, 1.0, 0.0965330892767645, 0.0914339631118999, 0.003190342698832632, 0.022268885790504313, 0.0, -1.0, 0.0, -1.0, -1.0, -1.0, 0.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, -1.0, 1.0, -1.0]
[0.08226727022239716, 2.987144231823633e-05, 2.2329338947249727e-06, 0.047448165869218496, 0.04753095407349041, 0.11459941368369171, 0.0011702815567795678, 1.0, 0.0969906953433135, 0.09170354727832318, 0.003358412434012629, 0.022329898179060795, 0.0, -1.0, 0.0, -1.0, -1.0, -1.0, 0.0, -1.0, -1.0, -1.0, 0.0, 0.0, 0.0, -1.0, 1.0, -1.0]
.
.
.
.
.
.
You can load it as a dataframe or a numpy array directly.
What problem are you having exactly?
When I try to load it as a numpy array it returns the list again
I am using the following code after loading the dataset.txt file into memory:
import numpy as np
dataset = load_doc(‘dataset.txt’)
x= np.asarray(dataset)
print (x)
Try:
print(type(x))
Thank you so much!
So my last question (hopefully) is that I have the dataset, the labels and a list of 28 titles for the columns. I am trying to load them in python so I can split them and create my training and testing datasets. I am not sure what to do with the titles. Do I need to load them as well?
You can use the column heading as the first line in the CSV file and load them automatically with pandas.
Alternately, you can specify them as the columns in python, if needed.
Or discard them completely.
hi
i am new .
please help me to convert image dataset to csv.
You don’t, instead you load images as arrays:
https://machinelearningmastery.mystagingwebsite.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/
how can i load data from parser
from parser import load_data #dataloading
I don’t understand, sorry. Perhaps try posting to stackoverflow?
Hi, Jason, the dataset has been removed from the above link and I want to check that because the whole of your book is based on that dataset only, so please provide us the dataset as it would become easy for us to understand concepts from your book, please provide the dataset.
Thank You
I provided an updated link directly in the post, here it is again:
https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv
sir, pls help me
i just want ,
how to classify categorical image by SVM and KNN alogrithm using python
Perhaps start here:
https://machinelearningmastery.mystagingwebsite.com/spot-check-classification-machine-learning-algorithms-python-scikit-learn/
Hello,
Thank you so much for all the great Tutorials. I would like to use a multivariate time series dataset and at first I need to make a similar format as of load_basic_motion data in Python. I have several text files each representing one feature and each file has time series data for each observation. Do you have any suggestions for preparing the data in the required format?
Thanks!
Perhaps this tutorial will provide a useful starting point and adapted to your needs:
https://machinelearningmastery.mystagingwebsite.com/how-to-model-human-activity-from-smartphone-data/
Hello,
i successfully loaded my csv file dataset. Its basically a letter dataset and now i want to train my python with this loaded dataset so that i can use this to recognise words later can you help me with is ?
thank you
Yes, you can get started with text data in Python here:
https://machinelearningmastery.mystagingwebsite.com/start-here/#nlp
Hi Jason,
One question here, may I know how can I load my non-csv data (a normal file instead) on spyder pyhton without converting to csv file dataset?
Yes, you can customize the call to read_csv() function for your dataset.
X = list(map(lambda x: np.array(x), X))
X = list(map(lambda x: x.reshape(1, x.shape[0], x.shape[1]), X))
y = np.expand_dims(y, axis=-1)
I used Tcn model .when i run i got this error .Index out of Range please please help me how to solve this error ..i also search from stackoverflow but not found
This is a common question that I answer here:
https://machinelearningmastery.mystagingwebsite.com/faq/single-faq/can-you-read-review-or-debug-my-code
Thanks for this nice article.I want to know if we have a digit classification problem and the last column contain the class.Then how to load and print the digits ignoring the last column.
I tried it and it is showing .
ValueError: cannot reshape array of size 257 into shape (16,16)
This tutorial will show you how to load and show image data:
https://machinelearningmastery.mystagingwebsite.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/
Thanks .But the pixels of the image are in csv format and the last column of the dataset contains the label which I want to ignore.The dataset I am using is usps.csv to classify digits.Thanks in advance.
That is very strange. Typically pixels are stored in an image format.
I’m not sure I have a tutorial that can help directly, you may have to write some custom code to load the CSV and convert it to an appropriate 3d numpy array.
Hi.I got my work done by keeping the data in the csv in numpy arrays and then slicing the array.However your tutorials are very nice and helpful.Thanks.
Well done!
Thanks 🙂
You’re welcome.
Dear Jason,
How I can load .rek dataset in python? please comment if possible. Thanks
I am not familiar with that file type, sorry.
Thanks Jason
You’re welcome.
how to load image dataset in python code
Perhaps start here:
https://machinelearningmastery.mystagingwebsite.com/how-to-load-and-manipulate-images-for-deep-learning-in-python-with-pil-pillow/
And here:
https://machinelearningmastery.mystagingwebsite.com/how-to-load-convert-and-save-images-with-the-keras-api/
hi jason, i am a fresher with no experience. how can i learn data science. can you suggest me a roadmap? that will be helpful for me.
Right here:
https://machinelearningmastery.mystagingwebsite.com/start-here/
hey jason,
i actually wanted to use some specific columns in a csv file for loading the data into a machine learning model. can you help me out.
Yes, load the data as normal, then select the columns you want to use, or delete the columns you do not want to use.
If you are new to numpy arrays, this will help:
https://machinelearningmastery.mystagingwebsite.com/gentle-introduction-n-dimensional-arrays-python-numpy/
And this:
https://machinelearningmastery.mystagingwebsite.com/index-slice-reshape-numpy-arrays-machine-learning-python/
Actually the data set i am using has data of two types of signals. i dont want to delete the columns. i want to use “the columns of one type of signal” in one model the other in the second one.
please do tell me if you can help me out
thank you tho
You can use the ColumnTransformer, for an example see this tutorial:
https://machinelearningmastery.mystagingwebsite.com/columntransformer-for-numerical-and-categorical-data/
Hi!! is it possible to cluster the similar rows of a csv file ( 2 columns) together using nlp. If yes could you please guide me with a post to help with the code.
Yes, sorry, I don’t have an example of clustering for text data.
If there are 9 variables in the dataset,
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
then while selecting X array it should be
X=array[:,1:8]
and
Y =array[:,9]
Can you explain why you have used?
X = array[:,0:8]
Y = array[:,8]
Hi…The tutorial is for illustrative purposes. Have you reviewed the content of X and Y variables after execution of the original code? Please let us know your thoughts.