How To Load Machine Learning Data in Python

You must be able to load your data before you can start your machine learning project.

The most common format for machine learning data is CSV files. There are a number of ways to load a CSV file in Python.

In this post you will discover the different ways that you can use to load your machine learning data in Python.

Let’s get started.

  • Update March/2017: Change loading from binary (‘rb’) to ASCII (‘rt).
  • Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.
  • Update March/2018: Updated NumPy load from URL example to work wth Python 3.
How To Load Machine Learning Data in Python

How To Load Machine Learning Data in Python
Photo by Ann Larie Valentine, some rights reserved.

Considerations When Loading CSV Data

There are a number of considerations when loading your machine learning data from CSV files.

For reference, you can learn a lot about the expectations for CSV files by reviewing the CSV request for comment titled Common Format and MIME Type for Comma-Separated Values (CSV) Files.

CSV File Header

Does your data have a file header?

If so this can help in automatically assigning names to each column of data. If not, you may need to name your attributes manually.

Either way, you should explicitly specify whether or not your CSV file had a file header when loading your data.

Comments

Does your data have comments?

Comments in a CSV file are indicated by a hash (“#”) at the start of a line.

If you have comments in your file, depending on the method used to load your data, you may need to indicate whether or not to expect comments and the character to expect to signify a comment line.

Delimiter

The standard delimiter that separates values in fields is the comma (“,”) character.

Your file could use a different delimiter like tab (“\t”) in which case you must specify it explicitly.

Quotes

Sometimes field values can have spaces. In these CSV files the values are often quoted.

The default quote character is the double quotation marks “\””. Other characters can be used, and you must specify the quote character used in your file.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Machine Learning Data Loading Recipes

Each recipe is standalone.

This means that you can copy and paste it into your project and use it immediately.

If you have any questions about these recipes or suggested improvements, please leave a comment and I will do my best to answer.

Load CSV with Python Standard Library

The Python API provides the module CSV and the function reader() that can be used to load CSV files.

Once loaded, you convert the CSV data to a NumPy array and use it for machine learning.

For example, you can download the Pima Indians dataset into your local directory (update: download from here). All fields are numeric and there is no header line. Running the recipe below will load the CSV file and convert it to a NumPy array.

The example loads an object that can iterate over each row of the data and can easily be converted into a NumPy array. Running the example prints the shape of the array.

For more information on the csv.reader() function, see CSV File Reading and Writing in the Python API documentation.

Load CSV File With NumPy

You can load your CSV data using NumPy and the numpy.loadtxt() function.

This function assumes no header row and all data has the same format. The example below assumes that the file pima-indians-diabetes.data.csv is in your current working directory.

Running the example will load the file as a numpy.ndarray and print the shape of the data:

This example can be modified to load the same dataset directly from a URL as follows:

Note: This example assumes you are using Python 3.

Again, running the example produces the same resulting shape of the data.

For more information on the numpy.loadtxt() function see the API documentation (version 1.10 of numpy).

Load CSV File With Pandas

You can load your CSV data using Pandas and the pandas.read_csv() function.

This function is very flexible and is perhaps my recommended approach for loading your machine learning data. The function returns a pandas.DataFrame that you can immediately start summarizing and plotting.

The example below assumes that the ‘pima-indians-diabetes.data.csv‘ file is in the current working directory.

Note that in this example we explicitly specify the names of each attribute to the DataFrame. Running the example displays the shape of the data:

We can also modify this example to load CSV data directly from a URL.

Again, running the example downloads the CSV file, parses it and displays the shape of the loaded DataFrame.

To learn more about the pandas.read_csv() function you can refer to the API documentation.

Summary

In this post you discovered how to load your machine learning data in Python.

You learned three specific techniques that you can use:

  • Load CSV with Python Standard Library.
  • Load CSV File With NumPy.
  • Load CSV File With Pandas.

Your action step for this post is to type or copy-and-paste each recipe and get familiar with the different ways that you can load machine learning data in Python.

Do you have any questions about loading machine learning data in Python or about this post? Ask your question in the comments and I will do my best to answer it.

Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

41 Responses to How To Load Machine Learning Data in Python

  1. ML704 January 17, 2017 at 7:17 pm #

    Hi!
    What is meant here in section Load CSV with Python Standard Library. You can download the Pima Indians dataset into your local directory.
    Where is my local directory?
    I tried several ways, but it did not work

    • Jason Brownlee January 18, 2017 at 10:13 am #

      It means to download the CSV file to the directory where you are writing Python code. Your project’s current working directory.

      • ML704 January 18, 2017 at 2:56 pm #

        Thank you, I got it now!

  2. ruby July 17, 2017 at 2:19 pm #

    hi
    how can load video dataset in python?? without tensorflow, keras, …

  3. constantine July 30, 2017 at 4:23 am #

    Hello,

    I want to keep from a CSV file only two columns and use these numbers, as x-y points, for a k-means implementation that I am doing.

    What I do now to generate my points is this:
    ” points = np.vstack(((np.random.randn(150, 2) * 0.75 + np.array([1, 0])),
    (np.random.randn(50, 2) * 0.25 + np.array([-0.5, 0.5])),
    (np.random.randn(50, 2) * 0.5 + np.array([-0.5, -0.5])))) “,
    but I want to apply my code on actual data.

    Any help?

    • Jason Brownlee July 30, 2017 at 7:52 am #

      Sorry, I don’t have any kmeans tutorials in Python. I may not be the best person to give you advice.

      • constantine July 30, 2017 at 7:51 pm #

        I don’t want anything about k-means, I have the code -computations and all- sorted out. I just want some help with the CSV files.

  4. Steve August 3, 2017 at 11:54 am #

    Thank you for explaining how to load data in detail.

  5. Fawad August 8, 2017 at 6:20 pm #

    Thanks you very much…really helpful…

  6. komal September 5, 2017 at 7:18 pm #

    how to load text attribute ? I got error saying could not convert string to float: b’Iris-setosa’

    • Jason Brownlee September 7, 2017 at 12:43 pm #

      You will need to load the data using Pandas then convert it to numbers.

      I give examples of this.

  7. R October 10, 2017 at 3:21 am #

    I was just wondering what the best practices are for converting something in a Relational Database model to an optimal ML format for fields that could be redundant. Ideally the export would be in CSV, but I know it won’t be as simple as an export every time. Hopefully simple example to illustrate my question: Say I have a table where I attribute things to an animal. The structure could be set up similarly to this:
    ID, Animal, Color,Continent
    1,Zebra,Black,Africa
    2,Zebra,White,Africa
    With the goal of being able to say “If the color is black and white and lives in Africa, it’s probably a zebra.” …so each line represents the animal with a single color associated with it, and other fields as well. Would this type of format be a best practice to feed into the model as is? Or, would it make more sense to concatenate the colors into one line with a delimiter? In other words, it may not always be a 1:1 relationship, and in cases where the dataset is like that, what’s the best way of formatting?
    Thanks for your time.

  8. Hemalatha S November 17, 2017 at 6:52 pm #

    can you tell me how to select features from a csv file

  9. Disha Umarwani November 28, 2017 at 12:41 pm #

    Hey,
    I am trying to load a line separated data.
    name:disha
    gender:female
    majors:computer science

    name:
    gender:
    majors:

    Any advice on this?

    • Jason Brownlee November 29, 2017 at 8:13 am #

      Ouch, looks like you might need to write some custom code to load each “line” or entity.

  10. Hemalatha S December 1, 2017 at 2:17 am #

    can you tell me how to load a csv file and apply feature selection methods?? can you post code for grey wolf optimizer algorithm??

  11. fxdingscxr January 17, 2018 at 4:42 pm #

    I have loaded the data into numpy array. What is the next thing that i should do to train my model?

  12. Ajinkya January 30, 2018 at 6:29 pm #

    Hey,
    I want to use KDD cup 99 dataset for the intrusion detection project. The dataset consist of String & numerical data. So should I convert entire dataset into numeric data or should I use it as it is?

  13. Bipin February 2, 2018 at 5:11 pm #

    Hey Jason,
    I have a dataset in csv which has header and all the columns have different datatype,
    which one would be better to use in this scenario: loadtxt() or genfromtxt().
    Also, is there any major performance difference in these 2 methods?

    • Jason Brownlee February 3, 2018 at 8:34 am #

      Use whatever you can, consider benchmarking the approaches with your data if speed is an issue.

  14. ML Beginer February 15, 2018 at 3:41 pm #

    I got a ValueError: could not convert string to float
    while reading this data :

    http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data

    Can you please reply where I am doing wrong?

    • ML Beginer February 15, 2018 at 3:45 pm #

    • Jason Brownlee February 16, 2018 at 8:31 am #

      You might have some “?” values. Convert them to 0 or nan first.

  15. ro May 8, 2018 at 4:25 am #

    filename = ‘C:\Users\user\Desktop\python.data.csv’
    raw_data = open(filename, ‘rt’)
    names = [‘pixle1’, ‘pixle2’, ‘pixle3’, ‘pixle4’, ‘pixle5’, ‘pixle6’, ‘pixle7’, ‘pixle8’, ‘pixle9’, ‘pixle10’, ‘pixle11’, ‘pixle12’, ‘pixle13’, ‘pixle14’, ‘pixle15’, ‘pixle16’, ‘pixle17’, ‘pixle18’, ‘pixle19’, ‘pixle20’, ‘pixle21’, ‘pixle22’, ‘pixle23’, ‘pixle24’, ‘pixle25’, ‘pixle26’, ‘pixle27’, ‘pixle28’, ‘pixle29’, ‘pixle30’, ‘class’]
    data = numpy.loadtxt(raw_data, names= names)

  16. AJS June 1, 2018 at 1:22 pm #

    I have multiple csv files of varying sizes that I want to use for training my neural network. I have around 1000 files ranging from about 15000 to 65000 rows of data. After I preprocess some of this data, one csv may be around 65000 rows by 20 columns array. My computer starts running out of memory very quickly on just 1 of the 65000 by 20 arrays, so I cannot combine all the 1000 files into one large csv file. Is there a way using keras to load one of the csv files, have the model learn on that data, then load the next file, have the file learn on that, and so on? Is there a better way to learn on so much data?

  17. Hemant June 17, 2018 at 2:32 pm #

    I have multiple 200 CSV files and labels files that contains 200 rows as output. I want to train, but unable to load the dataset

    • Jason Brownlee June 18, 2018 at 6:39 am #

      You may have to write come custom code to load each CSV in turn. E.g. in a loop over the files in the directory.

  18. Aman July 12, 2018 at 4:10 am #

    I got the error:

    Traceback (most recent call last):
    File “sum.py”, line 8, in
    data= numpy.array(x).astype(float)
    ValueError: setting an array element with a sequence.

    why?

Leave a Reply