7 Ways to Handle Large Data Files for Machine Learning

Exploring and applying machine learning algorithms to datasets that are too large to fit into memory is pretty common.

This leads to questions like:

  • How do I load my multiple gigabyte data file?
  • Algorithms crash when I try to run my dataset; what should I do?
  • Can you help me with out-of-memory errors?

In this post, I want to offer some common suggestions you may want to consider.

7 Ways to Handle Large Data Files for Machine Learning

7 Ways to Handle Large Data Files for Machine Learning
Photo by Gareth Thompson, some rights reserved.

1. Allocate More Memory

Some machine learning tools or libraries may be limited by a default memory configuration.

Check if you can re-configure your tool or library to allocate more memory.

A good example is Weka, where you can increase the memory as a parameter when starting the application.

2. Work with a Smaller Sample

Are you sure you need to work with all of the data?

Take a random sample of your data, such as the first 1,000 or 100,000 rows. Use this smaller sample to work through your problem before fitting a final model on all of your data (using progressive data loading techniques).

I think this is a good practice in general for machine learning to give you quick spot-checks of algorithms and turnaround of results.

You may also consider performing a sensitivity analysis of the amount of data used to fit one algorithm compared to the model skill. Perhaps there is a natural point of diminishing returns that you can use as a heuristic size of your smaller sample.

3. Use a Computer with More Memory

Do you have to work on your computer?

Perhaps you can get access to a much larger computer with an order of magnitude more memory.

For example, a good option is to rent compute time on a cloud service like Amazon Web Services that offers machines with tens of gigabytes of RAM for less than a US dollar per hour.

I have found this approach very useful in the past.

See the post:

4. Change the Data Format

Is your data stored in raw ASCII text, like a CSV file?

Perhaps you can speed up data loading and use less memory by using another data format. A good example is a binary format like GRIB, NetCDF, or HDF.

There are many command line tools that you can use to transform one data format into another that do not require the entire dataset to be loaded into memory.

Using another format may allow you to store the data in a more compact form that saves memory, such as 2-byte integers, or 4-byte floats.

5. Stream Data or Use Progressive Loading

Does all of the data need to be in memory at the same time?

Perhaps you can use code or a library to stream or progressively load data as-needed into memory for training.

This may require algorithms that can learn iteratively using optimization techniques such as stochastic gradient descent, instead of algorithms that require all data in memory to perform matrix operations such as some implementations of linear and logistic regression.

For example, the Keras deep learning library offers this feature for progressively loading image files and is called flow_from_directory.

Another example is the Pandas library that can load large CSV files in chunks.

6. Use a Relational Database

Relational databases provide a standard way of storing and accessing very large datasets.

Internally, the data is stored on disk can be progressively loaded in batches and can be queried using a standard query language (SQL).

Free open source database tools like MySQL or Postgres can be used and most (all?) programming languages and many machine learning tools can connect directly to relational databases. You can also use a lightweight approach, such as SQLite.

I have found this approach to be very effective in the past for very large tabular datasets.

Again, you may need to use algorithms that can handle iterative learning.

7. Use a Big Data Platform

In some cases, you may need to resort to a big data platform.

That is, a platform designed for handling very large datasets, that allows you to use data transforms and machine learning algorithms on top of it.

Two good examples are Hadoop with the Mahout machine learning library and Spark wit the MLLib library.

I do believe that this is a last resort when you have exhausted the above options, if only for the additional hardware and software complexity this brings to your machine learning project.

Nevertheless, there are problems where the data is very large and the previous options will not cut it.

Summary

In this post, you discovered a number of tactics that you can use when dealing with very large data files for machine learning.

Are there other methods that you know about or have tried?
Share them in the comments below.

Have you try any of these methods?
Let me know in the comments.

54 Responses to 7 Ways to Handle Large Data Files for Machine Learning

  1. Avatar
    Chris May 29, 2017 at 6:59 pm #

    If the raw data is seperated by line break, such as csv EDIFACT, ect. Then there is a feature in almost every language I am aware of that will read only 1 line at a time using a socket stream. Which typically how any (buzzword alert) big data solution does it under the hood, nothing magic, hard, or revolutionary about it actually you’ll find pretty much any simple GitHub repo doing it if they read files.
    Any beginner coder should encounter this and universities should absolutely be teaching such a basic concept in any computer science related degree where you are required to read from a file..

    Just thought I’d shed some light on this fact, the 7 ways are actually 7 things that if you see them as an example in blog posts you should immediately leave the site and never return.

    • Avatar
      Jason Brownlee June 2, 2017 at 12:24 pm #

      Thanks for the input Chris, there are a lot of different types of machine learning practitioners out there.

    • Avatar
      MicrobicTiger June 2, 2017 at 1:24 pm #

      Hi Chris,

      What if your data were geographic points with values, each line represented a different point and you were looking to recognize patterns across clusters of points with varying cluster geometries? How would line by line source data reading help me here?

    • Avatar
      David Severson March 10, 2019 at 1:32 pm #

      Sorry for late reply. You are correct that files can be processed one line at a time. However, various algos need to reference almost any other piece of data in the set or maybe massive pieces of intermediate date created in the process. As a result those algos are much more difficult to reduce their memory requirements. Off the top of my head they do these in code by using some of these techniques….

      Sometimes multiple passes through data on disk
      Sometimes intermediate date is small enough to squeeze into memory
      Sometime none of the above are possible and algo must go to disk to reference a date element so they put that data in some sort of indexed structure like an HBase is done.
      Sometimes they can do it in pieces and use something like gradient descent as noted here

      It is a serious engineering problem when the size of training set gets to large and the algo can’t progressively process as you propose here

      • Avatar
        Jason Brownlee March 11, 2019 at 6:46 am #

        Yes, it’s a totally different beast!

      • Avatar
        malla pavani March 10, 2020 at 4:45 pm #

        Hi do you know how to prepocess reuters-50-50 dataset.If you know please help me

  2. Avatar
    felipe almeida May 30, 2017 at 4:43 pm #

    Some of the tips are a little bit obvious but overall it’s good. You could give more examples in each topic, such as “use file format X for cases like Y”.. Also, you could mention things like using stochastic gradient descent or other kinds of online learning, where you feed the examples one at a time.

  3. Avatar
    felipe almeida May 30, 2017 at 4:45 pm #

    Oh yeah, you could also mention using sparse (rather than dense) matrices, as they take much less space and some algorithms (like SVM) can handle sparse feature matrices directly. Here’s a link explaining that for sklearn.

    • Avatar
      Jason Brownlee June 2, 2017 at 12:34 pm #

      Great suggestion.

    • Avatar
      Hesam March 9, 2021 at 8:25 pm #

      I can’t reach df.to_sparse wit the latest version of pandas would you help me with that?

  4. Avatar
    Peter Marelas May 30, 2017 at 9:29 pm #

    A few things I would suggest if you are a python user.

    For out-of-core pre-processing:

    – Transform the data using a dask dataframe or array (it can read various formats, CSV, etc)
    – Once you are done save the dask dataframe or array to a parquet file for future out-of-core pre-processing (see pyarrow)

    For in-memory processing:

    – Use smaller data types where you can, i.e. int8, float16, etc.
    – If it still doesn’t fit in-memory convert the dask dataframe to a sparse pandas dataframe

    For Big Data try Greenplum (free) https://greenplum.org/. It is a derivative of Postgres. Benefit being queries are processed across cores in parallel. Also has a mature machine learning plugin called MADlib.

  5. Avatar
    Lee Zee June 20, 2017 at 4:37 am #

    Can feature selection applications identify features that are comprised of parts of multiple columns in a large datasets? Or, will each identified predictive feature be restricted to data from a single column of data?

    • Avatar
      Jason Brownlee June 20, 2017 at 6:42 am #

      Often they focus on single columns. Perhaps you can dip into research and find some more complex methods.

  6. Avatar
    Dan August 14, 2017 at 11:58 am #

    Hi Jason,

    I have encountered a problem when using NLTK to analysis text based on Hadoop/Spark environment, and the problem is the NLTK data (corpora) can’t be find on each worker node (I only download the NLTK data in worker node, and I can’t download these data on each worker node due to access limitation.

    Could you give me some suggestion about how to conduct NLP analysis with NLTK data on each worker node without download the NLTK data in each worker node?

    Thanks in advance.

    • Avatar
      Jason Brownlee August 15, 2017 at 6:28 am #

      Sorry, I have not used NLTK in Hadoop/Spark, I cannot give you good advice.

  7. Avatar
    Azhaar Hussain August 30, 2017 at 5:43 am #

    Hi Jason,

    I wanted to understand how we put the machine learning in use in practice? Let say, I have developed a model for prediction and it works, how do I put this in production?

    Thanks,
    Azhaar

  8. Avatar
    debraj August 31, 2017 at 7:23 pm #

    i have a log data with lot of procedure in it. how i can apply ML for a new log data to predict what and all procedure are followed

  9. Avatar
    Daniel November 20, 2017 at 3:56 am #

    Hi Jason,

    Thanks for this great post !!!
    I am curious why you said Big Data platform such as Hadoop and Spark is the last resort ? What’s the reason !

    Thank you,

    Daniel

    • Avatar
      Jason Brownlee November 20, 2017 at 10:21 am #

      It is a lot of overhead to bring to the table only when it is really needed, e.g. only when you exhaust other options and you truly need a big data platform.

      I was not trying to offend. If you’re doing hadoop all day and want to run small data through it, then by all means.

    • Avatar
      David Severson March 10, 2019 at 1:40 pm #

      I want to second the commenter here. Some companies have these capabilities already setup and may already have much of the date in question. So spark may be the first place to go after an easy run on your desktop doesn’t pan out. Forcing and undeposited desktop to process the problem may take considerably longer.

      Also, spinning up EMR with or without spark on AWS can be pretty quick if you don’t have in-house stuff running.

      Too many business problems are getting too big for our laptops not to have some resources prepped and ready to go.

  10. Avatar
    oksana December 3, 2017 at 4:58 am #

    Hello Dr. Brownlee,
    Your recommendation No. 6 – 6. Use a Relational Database, this is what i tried to do using R and failed (not enough knowledge). I was being able to connect to SQL db certain columns of interest, but was unable to extract data I needed – is there a resource you recommend that i read? Any recommendation is highly appreciated.
    thank you!

    • Avatar
      Jason Brownlee December 3, 2017 at 5:28 am #

      I have done this myself, but it was years ago. Sorry, I don’t have a good resource to recommend other than some google searching.

  11. Avatar
    Liam January 24, 2018 at 10:49 pm #

    Awesome. Thank you for posting your thoughts about this
    machine learning problem. 🙂

    It helped me a lot for my current project.

    Thank you

    Liam

  12. Avatar
    Bhavika Panara May 22, 2018 at 10:22 pm #

    Hi, Jason Brownlee

    I want to feed very large image dataset which has 1200000 images and 15,000 classes to convolution neural network. But I am not able to feed all images to CNN However, I have GTX 1080ti 11 GB GPU and 32 GB CPU RAM.

    how can I train my model on this very large image dataset on my limited computing resource?

    Is there any technique available so I can train my model using multiple chunks of images.

  13. Avatar
    Anjali Batra August 28, 2018 at 8:20 pm #

    increase the memory as a parameter : Linked provided in the first point doesnt work anymore. Can you please suggest some other url for the same purpose.

    • Avatar
      Jason Brownlee August 29, 2018 at 8:10 am #

      You can increase the memory for a Java application by adding the -Xmx flag, for example for 8GB use -Xmx8000m

  14. Avatar
    Sintyadi Thong April 23, 2019 at 2:07 pm #

    Hi, Jason!
    It is a very good article.

    I am wondering, let’s say I have 300Million rows of data.
    Is it legit to use bootstrapping with samples from the 300 million rows?

    let’s say I use sample with replacement to get a bag of 30 million rows, and create N bags of it. From each of them, I run a model..

    so somewhat like bagging, but the number of rows in each bag is less than the actual number of rows.

    Is it possible and logically correct?

    Thanks!

    • Avatar
      Jason Brownlee April 23, 2019 at 2:33 pm #

      Perhaps. It depends on how sensitive your model is to the amount of data, and how much time/compute you have available.

  15. Avatar
    Eric Ngo May 26, 2019 at 6:56 am #

    Hi Jason,
    I have about 960 .csv files that each .csv file contains speech/voice of a person and 120 transcripts. Should I concatenate 960 .csv files into a single file?

    • Avatar
      Jason Brownlee May 27, 2019 at 6:35 am #

      Perhaps, it really depends on how you intend to model the problem.

  16. Avatar
    Ashley July 3, 2019 at 9:16 am #

    What if I am using Orange? I am using the software for financial analysis because I am not a programmer and cannot code. Can Orange handle large sets of data?

  17. Avatar
    shivan mohammed March 31, 2020 at 5:43 am #

    hello sir
    is it possible to use 1 GB of dataset (2000 .dicom file) to deep learning? and reduce number of epoch from 20 epoch to 10 epoch in order to get a high accuracy ?
    thanks.

  18. Avatar
    shivan mohammed April 4, 2020 at 12:45 am #

    i obtain a high accuracy. but i am doubt about it

  19. Avatar
    Cornelius February 23, 2021 at 11:30 pm #

    Jason,

    I want to add that there is a kernel level library that turns disk into memory.

    https://github.com/mimgrund/rambrain

    Modern C++ is more like Python these days. I think this may be helpful for some people.

  20. Avatar
    Denver January 26, 2022 at 4:26 am #

    Hi Jason,,

    We are using Machine Learning Models and storing in S3 bucket as csv file. But every new day our new bucket getting bigger than before. Recently our s3 bucket was 90Megabyte but one day will be terabyte so how we can store the data lightweight? any methods share with us?

    • Avatar
      James Carmichael January 26, 2022 at 11:05 am #

      I recommend running large models or long-running experiments on a server.

      I recommend only using your workstation for small experiments and for figuring out what large experiments to run. I talk more about this approach here:

      Machine Learning Development Environment
      I recommend using Amazon EC2 service as it provides access to Linux-based servers with lots of RAM, lots of CPU cores, and lots of GPU cores (for deep learning).

      You can learn how to setup an EC2 instance for machine learning in these posts:

      How To Develop and Evaluate Large Deep Learning Models with Keras on Amazon Web Services
      How to Train XGBoost Models in the Cloud with Amazon Web Services
      You can learn useful commands when working on the server instance in this post:

      10 Command Line Recipes for Deep Learning on Amazon Web Services

  21. Avatar
    Cansu March 22, 2023 at 1:19 am #

    I firstly look at your posts whenever I have a question, they are all very clear. Thank you so much.
    Can you plase share a post that shows how to implement Neural Networks for image classification with MLLib?

    • Avatar
      James Carmichael March 22, 2023 at 9:58 am #

      You are very welcome Cansu! We appreciate your suggestions!

Leave a Reply