[New Book] Click to get The Beginner's Guide to Data Science!
Use the offer code 20offearlybird to get 20% off. Hurry, sale ends soon!

How to Save a NumPy Array to File for Machine Learning

Developing machine learning models in Python often requires the use of NumPy arrays.

NumPy arrays are efficient data structures for working with data in Python, and machine learning models like those in the scikit-learn library, and deep learning models like those in the Keras library, expect input data in the format of NumPy arrays and make predictions in the format of NumPy arrays.

As such, it is common to need to save NumPy arrays to file.

For example, you may prepare your data with transforms like scaling and need to save it to file for later use. You may also use a model to make predictions and need to save the predictions to file for later use.

In this tutorial, you will discover how to save your NumPy arrays to file.

After completing this tutorial, you will know:

  • How to save NumPy arrays to CSV formatted files.
  • How to save NumPy arrays to NPY formatted files.
  • How to save NumPy arrays to compressed NPZ formatted files.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

How to Save a NumPy Array to File for Machine Learning

How to Save a NumPy Array to File for Machine Learning
Photo by Chris Combe, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

  1. Save NumPy Array to .CSV File (ASCII)
  2. Save NumPy Array to .NPY File (binary)
  3. Save NumPy Array to .NPZ File (compressed)

1. Save NumPy Array to .CSV File (ASCII)

The most common file format for storing numerical data in files is the comma-separated variable format, or CSV for short.

It is most likely that your training data and input data to your models are stored in CSV files.

It can be convenient to save data to CSV files, such as the predictions from a model.

You can save your NumPy arrays to CSV files using the savetxt() function. This function takes a filename and array as arguments and saves the array into CSV format.

You must also specify the delimiter; this is the character used to separate each variable in the file, most commonly a comma. This can be set via the “delimiter” argument.

1.1 Example of Saving a NumPy Array to CSV File

The example below demonstrates how to save a single NumPy array to CSV format.

Running the example will define a NumPy array and save it to the file ‘data.csv‘.

The array has a single row of data with 10 columns. We would expect this data to be saved to a CSV file as a single row of data.

After running the example, we can inspect the contents of ‘data.csv‘.

We should see the following:

We can see that the data is correctly saved as a single row and that the floating point numbers in the array were saved with full precision.

1.2 Example of Loading a NumPy Array from CSV File

We can load this data later as a NumPy array using the loadtext() function and specify the filename and the same comma delimiter.

The complete example is listed below.

Running the example loads the data from the CSV file and prints the contents, matching our single row with 10 columns defined in the previous example.

2. Save NumPy Array to .NPY File (binary)

Sometimes we have a lot of data in NumPy arrays that we wish to save efficiently, but which we only need to use in another Python program.

Therefore, we can save the NumPy arrays into a native binary format that is efficient to both save and load.

This is common for input data that has been prepared, such as transformed data, that will need to be used as the basis for testing a range of machine learning models in the future or running many experiments.

The .npy file format is appropriate for this use case and is referred to as simply “NumPy format“.

This can be achieved using the save() NumPy function and specifying the filename and the array that is to be saved.

2.1 Example of Saving a NumPy Array to NPY File

The example below defines our two-dimensional NumPy array and saves it to a .npy file.

After running the example, you will see a new file in the directory with the name ‘data.npy‘.

You cannot inspect the contents of this file directly with your text editor because it is in binary format.

2.2 Example of Loading a NumPy Array from NPY File

You can load this file as a NumPy array later using the load() function.

The complete example is listed below.

Running the example will load the file and print the contents, confirming that both it was loaded correctly and that the content matches what we expect in the same two-dimensional format.

3. Save NumPy Array to .NPZ File (compressed)

Sometimes, we prepare data for modeling that needs to be reused across multiple experiments, but the data is large.

This might be pre-processed NumPy arrays like a corpus of text (integers) or a collection of rescaled image data (pixels). In these cases, it is desirable to both save the data to file, but also in a compressed format.

This allows gigabytes of data to be reduced to hundreds of megabytes and allows easy transmission to other servers of cloud computing for long algorithm runs.

The .npz file format is appropriate for this case and supports a compressed version of the native NumPy file format.

The savez_compressed() NumPy function allows multiple NumPy arrays to be saved to a single compressed .npz file.

3.1 Example of Saving a NumPy Array to NPZ File

We can use this function to save our single NumPy array to a compressed file.

The complete example is listed below.

Running the example defines the array and saves it into a file in compressed numpy format with the name ‘data.npz’.

As with the .npy format, we cannot inspect the contents of the saved file with a text editor because the file format is binary.

3.2 Example of Loading a NumPy Array from NPZ File

We can load this file later using the same load() function from the previous section.

In this case, the savez_compressed() function supports saving multiple arrays to a single file. Therefore, the load() function may load multiple arrays.

The loaded arrays are returned from the load() function in a dict with the names ‘arr_0’ for the first array, ‘arr_1’ for the second, and so on.

The complete example of loading our single array is listed below.

Running the example loads the compressed numpy file that contains a dictionary of arrays, then extracts the first array that we saved (we only saved one), then prints the contents, confirming the values and the shape of the array matches what we saved in the first place.

Further Reading

This section provides more resources on the topic if you are looking to go deeper.

Posts

APIs

Summary

In this tutorial, you discovered how to save your NumPy arrays to file.

Specifically, you learned:

  • How to save NumPy arrays to CSV formatted files.
  • How to save NumPy arrays to NPY formatted files.
  • How to save NumPy arrays to compressed NPZ formatted files.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Discover Fast Machine Learning in Python!

Master Machine Learning With Python

Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

40 Responses to How to Save a NumPy Array to File for Machine Learning

  1. Avatar
    Eric November 13, 2019 at 8:03 am #

    Very interesting. Is there a difference in performance among them? Especially between CSV and NPY? Unless there’s one, using the portable CSV might be more convenient.

    I think that for fast file systems NPY should be faster than NPZ, but on very large arrays and slow file systems NPZ could sometimes be faster.

    • Avatar
      Jason Brownlee November 13, 2019 at 1:44 pm #

      Thanks!

      Good question. I don’t have good stats on performance comparisons, although working with 10/100MB of random floats in an array would give results quickly.

      My expectation is that getting data into RAM fast, e.g. compressed would have the best performance.

      I use NPY and NPZ a lot myself.

  2. Avatar
    Deep Learner November 14, 2019 at 11:07 pm #

    Hi Jason,

    Thanks for the post, a very useful feature, it is a good complement to another good post on this very site which deals with models: https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

  3. Avatar
    Darmawan Utomo November 15, 2019 at 7:18 pm #

    Hi Jason,

    It seems that savetxt is only for 1D or 2D.
    But no problem with npy and npz.

    Thank you.

  4. Avatar
    araya November 19, 2019 at 12:11 am #

    please would you mind to give me the book regarding this title

    • Avatar
      Jason Brownlee November 19, 2019 at 7:42 am #

      A book on saving NumPy arrays?

      What additional problems are you having exactly?

  5. Avatar
    Anirban Ray November 26, 2019 at 12:18 am #

    Hi!

    Can you please tell me whether it is possible to append to a .npy file?

    For example, suppose I have an numpy array x, and stored it in x.npy. If I now want to append a few elements to it, do I have to load it, append, and then save again? Or, is there a way to append directly to the x.npy file without loading it?

    Thanks for this and all the other great articles.

    • Avatar
      Jason Brownlee November 26, 2019 at 6:06 am #

      Maybe.

      Instead, I would recommend loading it into memory, append to the array, then save it again.

      • Avatar
        Eva February 12, 2020 at 11:40 pm #

        Great article, as always Dr. Jason!
        However, if the data is too large to fit in RAM, then loading the .npy file into memory, appending to the array, then saving it again would not be possible, I think.
        How to store very large data to .npy file then??

  6. Avatar
    Mona January 11, 2020 at 8:03 am #

    How do I know what arr_0 is in any arbitrary npz file that I load?

    data = dict_data[‘arr_0’]

    • Avatar
      Jason Brownlee January 11, 2020 at 8:14 am #

      Arrays are loaded in the same order that they were saved.

  7. Avatar
    Esha January 11, 2020 at 10:41 pm #

    After creating .npz file in google colab and saving it in google drive, How can I APPEND anything to the save .npz file

    • Avatar
      Jason Brownlee January 12, 2020 at 8:02 am #

      You can load the array, concat the array and re-save it.

  8. Avatar
    Akil R March 8, 2020 at 5:38 am #

    I tried this to dump a large list of arrays, I wasn’t successful. Process gets killed after long wait.

  9. Avatar
    Mahshad March 31, 2020 at 10:04 am #

    Thank you Jason.

    I have saved an image as an array into a csv file but when I tried to display the image from the saved array it doesn’t show the picture.

    This is while I can show the image from the array I made from the same picture (without saving as csv) but when it reads from csv file it doesn’t work.
    I think it’s because of those dots that comes after each number in the array:

    [0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]

    Do u know how can I get rid of those dots when reading from csv?

    • Avatar
      Jason Brownlee March 31, 2020 at 1:34 pm #

      Yes, you may have to change the shape of the array and scale the pixel values before displaying tht using matplotlib.

      I do have examples of this on the blog.

  10. Avatar
    Rucha May 2, 2020 at 10:49 pm #

    This is for a 1D array. How do I convert a 2D array to .csv format?

    • Avatar
      Jason Brownlee May 3, 2020 at 6:10 am #

      The code is identical.

      • Avatar
        Gill May 10, 2020 at 10:46 pm #

        Hoe moet je dan een 3D-array opslaan in een .csv bestand?

        • Avatar
          Jason Brownlee May 11, 2020 at 5:59 am #

          The same as a 2d array. E.g. call the same functions to save and load it.

          • Avatar
            Gill May 11, 2020 at 6:03 pm #

            Als ik dit uitvoer voor mijn array krijg ik deze foutmelding. Expected 1D or 2D array, got 4D array instead. Wilt dus dus eigenlijk zeggen dat ik een 4D array heb en nu komt opnieuw mijn vraag, hoe sla ik dit het beste op. Dit is voorlopig mijn code om dit op te slaan:
            # save numpy array as csv file
            from numpy import asarray
            from numpy import savetxt
            # define data
            nieuwe_array =asarray([[nieuwe_array]])
            # save to csv file
            savetxt(‘nieuwe_array.csv’, nieuwe_array, delimiter=’,’)

          • Avatar
            Jason Brownlee May 12, 2020 at 6:41 am #

            That is surprising.

            As far as I know, saving an array is agnostic to the size and dimensionality of the array.

  11. Avatar
    Souvik Mukherjee October 3, 2020 at 7:15 am #

    Jason,
    Your tutorials rock!! I have enjoyed and benefited from a number of them. If there’s a place to provide a recommendation for your work, just let me know.
    Thanks.
    Souvik

    • Avatar
      Jason Brownlee October 3, 2020 at 7:59 am #

      Thanks!

      Yes, anything you can do to spread the word on social media helps.

  12. Avatar
    Imdadul Haque October 4, 2020 at 3:34 am #

    Is it possible to load image Dataset then convert the image dataset into csv file?

  13. Avatar
    JoAnn Alvarez November 19, 2020 at 6:35 am #

    I’m wondering whether np.save, the np.savez_compressed, or some other method (joblib, json) would be best for my situation. The array is about 150 GB in memory.

    First I tried json, but it exceeded my memory:

    with open(‘X_train_list.json’, ‘w’) as file_handle:
    json.dump(X_train.tolist(), file_handle)

    I have X_train (a np array), and I first converted it using X_train.tolist(), and then used json.dump(). I think converting it to a list and/or using json.dump() saved copies in memory before writing.

    My first priority is not exceeding my memory. Then other things to consider are: speed of writing/reading, file size, universality of file type (for example, is it going to easily break if I use a new version of dill, can other programs open it).

    Do you have any advice? Much appreciated for this tutorial.

    • Avatar
      Jason Brownlee November 19, 2020 at 7:55 am #

      That is large!

      My advice is to trial a few approaches and discover which meets your requirements. Maybe do a little research into specialized methods for managing large data, e.g. memory mapped files.

  14. Avatar
    Razi December 15, 2020 at 4:48 pm #

    hello!
    I want to save my all images (genuine and forgery) paths with their labels 1 as genuine and 0 as forgery, after comparison of genuine and forgery third column would be labels again which would show its genuine or forgery in the form 0 or 1, in the txt or csv file.
    I am trying to do that but i couldn’t, Could you help me?

    I want the below format,

    Path1 Path2 labels
    E/img.jpg 1 E/img.jpg 1 1
    E/img.jpg 1 E/img.jpg 0 0

    • Avatar
      Jason Brownlee December 16, 2020 at 7:44 am #

      Perhaps construct your data as an array in memory first, then save the array to file as a CSV.

  15. Avatar
    Mayank Mishra March 5, 2021 at 4:44 pm #

    I ran a code to store my images (2868 of them) as an array for an image classification task. The shape of the array is (2868, 224, 224, 3). I saved the array using numpy.save(). But later as I try to load the array, it gives the following error, “cannot reshape array of size 92437951 into shape (2868,224,224,3)”. How to overcome this issue? Does numpy.load() not work for multidimensional arrays?

    • Avatar
      Jason Brownlee March 6, 2021 at 5:14 am #

      That is very odd.

      Perhaps try posting your code and error message to stackoverflow.com

  16. Avatar
    Debajyoti Ghosh March 7, 2021 at 6:57 pm #

    How to reading a NumPy matrix from CSV File and perform operations and access.
    Say I want to find each row’s max

  17. Avatar
    Ankit April 3, 2021 at 4:32 pm #

    Hi, Jason could you please tell me how can I store the frame pixel values of a bunch of videos into a NumPy .npz file for training the model. And I am using LSTM for predicting some classes in all the videos, but it always shows a shape error when I use the LSTM layer. Can you please help me out?

Leave a Reply