How to Load Machine Learning Data From Scratch In Python

You must know how to load data before you can use it to train a machine learning model.

When starting out, it is a good idea to stick with small in-memory datasets using standard file formats like comma separated value (.csv).

In this tutorial you will discover how to load your data in Python from scratch, including:

  • How to load a CSV file.
  • How to convert strings from a file to floating point numbers.
  • How to convert class values from a file to integers.

Kick-start your project with my new book Machine Learning Algorithms From Scratch, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

  • Update Nov/2016: Added an improved data loading function to skip empty lines.
  • Update Aug/2018: Tested and updated to work with Python 3.6.
How to Load Machine Learning Data From Scratch In Python

How to Load Machine Learning Data From Scratch In Python
Photo by Amanda B, some rights reserved.

Description

Comma Separated Values

The standard file format for small datasets is Comma Separated Values or CSV.

In it’s simplest form, CSV files are comprised of rows of data. Each row is divided into columns using a comma (“,”).

You can learn more about the CSV file format in RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files.

In this tutorial, we are going to practice loading two different standard machine learning datasets in CSV format.

Pima Indians Diabetes Dataset

The first is the Pima Indians diabetes dataset. It contains 768 rows and 9 columns.

All of the values in the file are numeric, specifically floating point values. We will learn how to load the file first, then later how to convert the loaded strings to numeric values.

Iris Flower Species Dataset

The second dataset we will work with is the iris flowers dataset.

It contains 150 rows and 4 columns. The first 3 columns are numeric. It is different in that the class value (final column) is a string, indicating a species of flower. We will learn how to convert the numeric columns from string to numbers and how to convert the flower species string into an integer that we can use consistently.

Tutorial

This tutorial is divided into 3 parts:

  1. Load a file.
  2. Load a file and convert Strings to Floats.
  3. Load a file and convert Strings to Integers.

These steps will provide the foundations you need to handle loading your own data.

1. Load CSV File

The first step is to load the CSV file.

We will use the csv module that is a part of the standard library.

The reader() function in the csv module takes a file as an argument.

We will create a function called load_csv() to wrap this behavior that will take a filename and return our dataset. We will represent the loaded dataset as a list of lists. The first list is a list of observations or rows, and the second list is the list of column values for a given row.

Below is the complete function for loading a CSV file.

We can test this function by loading the Pima Indians dataset. Download the dataset and place it in the current working directory with the name pima-indians-diabetes.csv. Open the file and delete any empty lines at the bottom.

Taking a peek at the first 5 rows of the raw data file we can see the following:

The data is numeric and separated by commas and we can expect that the whole file meets this expectation.

Let’s use the new function and load the dataset. Once loaded we can report some simple details such as the number of rows and columns loaded.

Putting all of this together, we get the following:

Running this example we see:

A limitation of this function is that it will load empty lines from data files and add them to our list of rows. We can overcome this by adding rows of data one at a time to our dataset and skipping empty rows.

Below is the updated example with this new improved version of the load_csv() function.

Running this example we see:

2. Convert String to Floats

Most, if not all machine learning algorithms prefer to work with numbers.

Specifically, floating point numbers are preferred.

Our code for loading a CSV file returns a dataset as a list of lists, but each value is a string. We can see this if we print out one record from the dataset:

This produces output like:

We can write a small function to convert specific columns of our loaded dataset to floating point values.

Below is this function called str_column_to_float(). It will convert a given column in the dataset to floating point values, careful to strip any whitespace from the value before making the conversion.

We can test this function by combining it with our load CSV function above, and convert all of the numeric data in the Pima Indians dataset to floating point values.

The complete example is below.

Running this example we see the first row of the dataset printed both before and after the conversion. We can see that the values in each column have been converted from strings to numbers.

3. Convert String to Integers

The iris flowers dataset is like the Pima Indians dataset, in that the columns contain numeric data.

The difference is the final column, traditionally used to hold the outcome or value to be predicted for a given row. The final column in the iris flowers data is the iris flower species as a string.

Download the dataset and place it in the current working directory with the file name iris.csv. Open the file and delete any empty lines at the bottom.

For example, below are the first 5 rows of the raw dataset.

Some machine learning algorithms prefer all values to be numeric, including the outcome or predicted value.

We can convert the class value in the iris flowers dataset to an integer by creating a map.

  1. First, we locate all of the unique class values, which happen to be: Iris-setosa, Iris-versicolor and Iris-virginica.
  2. Next, we assign an integer value to each, such as: 0, 1 and 2.
  3. Finally, we replace all occurrences of class string values with their corresponding integer values.

Below is a function to do just that called str_column_to_int(). Like the previously introduced str_column_to_float() it operates on a single column in the dataset.

We can test this new function in addition to the previous two functions for loading a CSV file and converting columns to floating point values. It also returns the dictionary mapping of class values to integer values, in case any users downstream want to convert predictions back to string values again.

The example below loads the iris dataset then converts the first 3 columns to floats and the final column to integer values.

Running this example produces the output below.

We can see the first row of the dataset before and after the data type conversions. We can also see the dictionary mapping of class values to integers.

Extensions

You learned how to load CSV files and perform basic data conversions.

Data loading can be a difficult task given the variety of data cleaning and conversion that may be required from problem to problem.

There are many extensions that you could make to make these examples more robust to new and different data files. Below are just a few ideas you can consider researching and implementing yourself:

  • Detect and remove empty lines at the top or bottom of the file.
  • Detect and handle missing values in a column.
  • Detect and handle rows that do not match expectations for the rest of the file.
  • Support for other delimiters such as “|” (pipe) or white space.
  • Support more efficient data structures such as arrays.

Two libraries you may wish to use in practice for loading CSV data are NumPy and Pandas.

NumPy offers the loadtxt() function for loading data files as NumPy arrays. Pandas offers the read_csv() function that offers a lot of flexibility regarding data types, file headers and more.

Review

In this tutorial, you discovered how you can load your machine learning data from scratch in Python.

Specifically, you learned:

  • How to load a CSV file into memory.
  • How to convert string values to floating point values.
  • How to convert a string class value into an integer encoding.

Do you have any questions about loading machine learning data or about this post?
Ask your question in the comments and I will do my best to answer.

Discover How to Code Algorithms From Scratch!

Machine Learning Algorithms From Scratch

No Libraries, Just Python Code.

...with step-by-step tutorials on real-world datasets

Discover how in my new Ebook:
Machine Learning Algorithms From Scratch

It covers 18 tutorials with all the code for 12 top algorithms, like:
Linear Regression, k-Nearest Neighbors, Stochastic Gradient Descent and much more...

Finally, Pull Back the Curtain on
Machine Learning Algorithms

Skip the Academics. Just Results.

See What's Inside

33 Responses to How to Load Machine Learning Data From Scratch In Python

  1. Avatar
    SalemAmeen October 12, 2016 at 11:34 am #

    Many thanks, could you please show us the best way to save numpy array and pandas data as cvs file.

    • Avatar
      Jason Brownlee October 13, 2016 at 8:33 am #

      Sorry, I do not have an example of saving data to a CSV.

  2. Avatar
    Harold October 18, 2016 at 2:06 pm #

    Could you show us how to load our own image data to replace the mnist data for convolutional neural network in your Deep Learning with Python?

    • Avatar
      Jason Brownlee October 19, 2016 at 9:14 am #

      Hi Harold, great question. I will make time to prepare an example soon.

  3. Avatar
    Matt Jang March 20, 2017 at 12:13 pm #

    Thank you so much

    • Avatar
      Matt Jang March 20, 2017 at 3:38 pm #

      Jason, I have a question. In the case that the dataset is not nicely organized into columns and rows (like how Iris, Pima D.S. are) but rather a random dump of strings, how can I convert the dataset so that the machine learning algorithms can recognize it? For example, I am trying to use machine learning algorithms to classify different malware log files. However, the log file has bunch of strings, symbols, as well as numbers. They are usually a haphazard collection of random queries (strings) that cannot be organized into columns, such as “sepal-length” or “class”, like above. Do you have any recommendations for me? I don’t know who else to ask..Thank you for your time.

    • Avatar
      Jason Brownlee March 21, 2017 at 8:36 am #

      You’re welcome Matt.

  4. Avatar
    Ray T April 20, 2017 at 12:42 am #

    Hi Jason
    thanks for the examples:

    could you help me with this?
    when I run this section:
    # convert string columns to float
    for i in range(4):
    str_column_to_float(dataset, i)
    # convert class column to int
    lookup = str_column_to_int(dataset, 4)
    print(dataset[0])
    print(lookup)

    I get this error:

    IndexError Traceback (most recent call last)
    in ()
    1 # convert string columns to float
    2 for i in range(4):
    —-> 3 str_column_to_float(dataset, i)
    4 # convert class column to int
    5 lookup = str_column_to_int(dataset, 4)

    in str_column_to_float(dataset, column)
    2 def str_column_to_float(dataset, column):
    3 for row in dataset:
    —-> 4 row[column] = float(row[column].strip())

    IndexError: list index out of range

    How do i fix that?
    Thanks

    • Avatar
      Jason Brownlee April 20, 2017 at 9:28 am #

      Are you using Python 2.7?

      • Avatar
        Ray April 21, 2017 at 1:12 am #

        No I’m using Python 3.5

        • Avatar
          Jason Brownlee April 21, 2017 at 8:38 am #

          The example was developed for Python 2.7, I hope to update it for Python 3 soon.

          • Avatar
            Ray April 22, 2017 at 2:12 am #

            Thanks
            I’ll keep trying to figure it out.

  5. Avatar
    RATNA NITIN PATIL July 25, 2017 at 2:52 am #

    Hi Jason,
    I have defined a load_csv function And typed following statements,

    filename = ‘pima-indians-diabetes.csv’
    dataset = load_csv(filename)

    I get the following error, Please help me.

    Syntax Error: invalid syntax

    • Avatar
      Jason Brownlee July 25, 2017 at 9:46 am #

      It sounds like a problem with your code.

      Perhaps you have extra spaces?

      Try running the code directly on the Python interpreter.

  6. Avatar
    gauri September 29, 2017 at 4:43 pm #

    python2 import1.py
    [‘5.1’, ‘3.5’, ‘1.4’, ‘0.2’, ‘Iris-setosa’]
    Traceback (most recent call last):
    File “import1.py”, line 32, in
    str_column_to_float(dataset, i)
    File “import1.py”, line 12, in str_column_to_float
    row[column] = float(row[column].strip())
    IndexError: list index out of range

    why am I getting this error even though I’m using python2?

    • Avatar
      Jason Brownlee September 30, 2017 at 7:37 am #

      Perhaps double check that you copied all of the code without error?

  7. Avatar
    Anna November 16, 2017 at 4:51 pm #

    hi Jason,

    Thanks for your awesome post! Never fail to amaze me 🙂

    Anyway, I’m having trouble converting my dataset that contained numeric and string values.
    My dataset looks like this:

    Timestamp ID Length Data
    0.0000022 02a1 8 05 20 ea 0a 20 1a 00 7f

    This is unsupervised learning and planning to adopt LSTM sequence classification in detecting anomalies in the dataset. No target variable. So when I want to convert them into floats, I got this error:

    “ValueError: could not convert string to float: ’05 20 ea 0a 20 1a 00 7f'”

    Is it because of spaces between them? Even the ‘ID’ rows were not converted to floats. I’m using Keras, my script looks like this:

    >>> dataset = data.values
    >>> dataset = dataset.astype(
    >>> dataset = dataset.astype(‘float32’)

    Thank you, I’m newbie new in machine learning. Looking forward for your reply Jason.

    • Avatar
      Jason Brownlee November 17, 2017 at 9:22 am #

      You will need to encode the text to numbers. You can use an integer encoder and/or a one hot encoder for label data or a bag of words or word embedding for real text data.

      I have examples of all of these on the blog, try the search as a first step.

      • Avatar
        Anna November 17, 2017 at 2:39 pm #

        Thank you Jason for your quick response. I’ve found several of your topics in this blog concerning converting into floats stuff like you said. I will let you know once it’s working.

        Thanks!

  8. Avatar
    Milan Modak December 27, 2017 at 3:32 am #

    Hi Jason,

    while running this code I’m getting an error

    row[column] = float(row[column].strip())
    ValueError: could not convert string to float: ‘7;0.27;0.36;20.7;0.045;45;170;1.001;3;0.45;8.8;6’

    Could you please let me know the exact problem and how I can rectify the issue.

    • Avatar
      Jason Brownlee December 27, 2017 at 5:21 am #

      Looks like you have a semicolon separator instead of the expected comma.

      Perhaps double check you have the right data file?

  9. Avatar
    Nikhil Woodruff April 26, 2018 at 12:16 am #

    By the way, when printing result of the csv read in, the command should be:

    print(‘Loaded data file {0} with {1} rows and {2} columns’.format(filename, len(dataset), len(dataset[0])))

    , not:

    print(‘Loaded data file {0} with {1} rows and {2} columns’).format(filename, len(dataset), len(dataset[0]))

    The .format() applies to the string, not the print() function.

  10. Avatar
    Akhil Menon May 6, 2018 at 10:47 am #

    I did as follows and converted my data from string. But then when I apply other functions mentioned in the first tutorial page such as daatset.describe and dataset.head it says list is object and cannot be called.
    How do I go about this?

    Thanks

  11. Avatar
    Rutger March 10, 2019 at 8:37 pm #

    Hello Jason,

    Great tutorial!
    When I run the script convert strings to float I get the error message:
    Error: iterator should return strings, not bytes (did you open the file in text mode?)
    Am I doing something incorrect?
    Hope to hear from you, thanks!

    Regards, Rutger

  12. Avatar
    Sofia May 5, 2021 at 6:53 pm #

    Hello Jason!

    after this conversion (e.g. str to int), is the data ready for spot check algorithms etc etc or not?

    Do i have to do anything else before or after the conversion? like OneHot encoding..

    Note that I have a dataset similar with iris dataset, but with one integer column and three string columns.

    thank you in adnavce!

    • Avatar
      Jason Brownlee May 6, 2021 at 5:42 am #

      It depends on the data. E.g. some data may require you to encode categorical variables first, and you may need to spot-check data preparation methods in addition to algorithms.

  13. Avatar
    Faiy V. November 23, 2021 at 2:09 am #

    Hello to everyone!!

    I have a dataset with 3991 rows and 8 Columns with different data types as seen below:

    15; 1215; FALSE; feed; 1; TRUE; TRUE; monument; attraction (CSV example)

    and when I run the code to convert them, I am getting this:

    Loaded data file INSTA POSTS.csv with 3991 rows and 1 columns.

    Why does it display only 1 column while it has 8 columns?

    I used your code several times and it worked great on similar datasets!

    I don’t understand what happens now!

    Any idea?

    thank you in advance !

    Faiy

    • Avatar
      Adrian Tam November 23, 2021 at 1:38 pm #

      CSV stands for comma separated values – you should not use semicolon

Leave a Reply