Real-world data often has missing values.
Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.
Handling missing data is important as many machine learning algorithms do not support data with missing values.
In this tutorial, you will discover how to handle missing data for machine learning with Python.
Specifically, after completing this tutorial you will know:
- How to marking invalid or corrupt values as missing in your dataset.
- How to remove rows with missing data from your dataset.
- How to impute missing values with mean values in your dataset.
Let’s get started.
Note: The examples in this post assume that you have Python 2 or 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.18 or higher.
- Update March/2018: Added alternate link to download the dataset as the original appears to have been taken down.

How to Handle Missing Values with Python
Photo by CoCreatr, some rights reserved.
Overview
This tutorial is divided into 6 parts:
- Pima Indians Diabetes Dataset: where we look at a dataset that has known missing values.
- Mark Missing Values: where we learn how to mark missing values in a dataset.
- Missing Values Causes Problems: where we see how a machine learning algorithm can fail when it contains missing values.
- Remove Rows With Missing Values: where we see how to remove rows that contain missing values.
- Impute Missing Values: where we replace missing values with sensible values.
- Algorithms that Support Missing Values: where we learn about algorithms that support missing values.
First, let’s take a look at our sample dataset with missing values.
1. Pima Indians Diabetes Dataset
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:
- 0. Number of times pregnant.
- 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- 2. Diastolic blood pressure (mm Hg).
- 3. Triceps skinfold thickness (mm).
- 4. 2-Hour serum insulin (mu U/ml).
- 5. Body mass index (weight in kg/(height in m)^2).
- 6. Diabetes pedigree function.
- 7. Age (years).
- 8. Class variable (0 or 1).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 |
6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 |
This dataset is known to have missing values.
Specifically, there are missing observations for some columns that are marked as a zero value.
We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.
Download the dataset from here and save it to your current working directory with the file name pima-indians-diabetes.csv (update: download from here).
2. Mark Missing Values
In this section, we will look at how we can identify and mark values as missing.
We can use plots and summary statistics to help identify missing or corrupt data.
We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.
1 2 3 |
from pandas import read_csv dataset = read_csv('pima-indians-diabetes.csv', header=None) print(dataset.describe()) |
Running this example produces the following output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
0 1 2 3 4 5 \ count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 6 7 8 count 768.000000 768.000000 768.000000 mean 0.471876 33.240885 0.348958 std 0.331329 11.760232 0.476951 min 0.078000 21.000000 0.000000 25% 0.243750 24.000000 0.000000 50% 0.372500 29.000000 0.000000 75% 0.626250 41.000000 1.000000 max 2.420000 81.000000 1.000000 |
This is useful.
We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.
Specifically, the following columns have an invalid zero minimum value:
- 1: Plasma glucose concentration
- 2: Diastolic blood pressure
- 3: Triceps skinfold thickness
- 4: 2-Hour serum insulin
- 5: Body mass index
Let’ confirm this my looking at the raw data, the example prints the first 20 rows of data.
1 2 3 4 5 |
from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # print the first 20 rows of data print(dataset.head(20)) |
Running the example, we can clearly see 0 values in the columns 2, 3, 4, and 5.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
0 1 2 3 4 5 6 7 8 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 5 5 116 74 0 0 25.6 0.201 30 0 6 3 78 50 32 88 31.0 0.248 26 1 7 10 115 0 0 0 35.3 0.134 29 0 8 2 197 70 45 543 30.5 0.158 53 1 9 8 125 96 0 0 0.0 0.232 54 1 10 4 110 92 0 0 37.6 0.191 30 0 11 10 168 74 0 0 38.0 0.537 34 1 12 10 139 80 0 0 27.1 1.441 57 0 13 1 189 60 23 846 30.1 0.398 59 1 14 5 166 72 19 175 25.8 0.587 51 1 15 7 100 0 0 0 30.0 0.484 32 1 16 0 118 84 47 230 45.8 0.551 31 1 17 7 107 74 0 0 29.6 0.254 31 1 18 1 103 30 38 83 43.3 0.183 33 0 19 1 115 70 30 96 34.6 0.529 32 1 |
We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.
We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.
1 2 3 |
from pandas import read_csv dataset = read_csv('pima-indians-diabetes.csv', header=None) print((dataset[[1,2,3,4,5]] == 0).sum()) |
Running the example prints the following output:
1 2 3 4 5 |
1 5 2 35 3 227 4 374 5 11 |
We can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows.
This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.
In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.
Values with a NaN value are ignored from operations like sum, count, etc.
We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.
After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.
1 2 3 4 5 6 7 |
from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # count the number of NaN values in each column print(dataset.isnull().sum()) |
Running the example prints the number of missing values in each column. We can see that the columns 1:5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.
We can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.
1 2 3 4 5 6 7 8 9 |
0 0 1 5 2 35 3 227 4 374 5 11 6 0 7 0 8 0 |
This is a useful summary. I always like to look at the actual data though, to confirm that I have not fooled myself.
Below is the same example, except we print the first 20 rows of data.
1 2 3 4 5 6 7 |
from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # print the first 20 rows of data print(dataset.head(20)) |
Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows.
It is clear from the raw data that marking the missing values had the intended effect.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
0 1 2 3 4 5 6 7 8 0 6 148.0 72.0 35.0 NaN 33.6 0.627 50 1 1 1 85.0 66.0 29.0 NaN 26.6 0.351 31 0 2 8 183.0 64.0 NaN NaN 23.3 0.672 32 1 3 1 89.0 66.0 23.0 94.0 28.1 0.167 21 0 4 0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 5 5 116.0 74.0 NaN NaN 25.6 0.201 30 0 6 3 78.0 50.0 32.0 88.0 31.0 0.248 26 1 7 10 115.0 NaN NaN NaN 35.3 0.134 29 0 8 2 197.0 70.0 45.0 543.0 30.5 0.158 53 1 9 8 125.0 96.0 NaN NaN NaN 0.232 54 1 10 4 110.0 92.0 NaN NaN 37.6 0.191 30 0 11 10 168.0 74.0 NaN NaN 38.0 0.537 34 1 12 10 139.0 80.0 NaN NaN 27.1 1.441 57 0 13 1 189.0 60.0 23.0 846.0 30.1 0.398 59 1 14 5 166.0 72.0 19.0 175.0 25.8 0.587 51 1 15 7 100.0 NaN NaN NaN 30.0 0.484 32 1 16 0 118.0 84.0 47.0 230.0 45.8 0.551 31 1 17 7 107.0 74.0 NaN NaN 29.6 0.254 31 1 18 1 103.0 30.0 38.0 83.0 43.3 0.183 33 0 19 1 115.0 70.0 30.0 96.0 34.6 0.529 32 1 |
Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.
3. Missing Values Causes Problems
Having missing values in a dataset can cause errors with some machine learning algorithms.
In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.
This is an algorithm that does not work when there are missing values in the dataset.
The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
from pandas import read_csv import numpy from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # split dataset into inputs and outputs values = dataset.values X = values[:,0:8] y = values[:,8] # evaluate an LDA model on the dataset using k-fold cross validation model = LinearDiscriminantAnalysis() kfold = KFold(n_splits=3, random_state=7) result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy') print(result.mean()) |
Running the example results in an error, as follows:
1 |
ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). |
This is as we expect.
We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.
Now, we can look at methods to handle the missing values.
4. Remove Rows With Missing Values
The simplest strategy for handling missing data is to remove records that contain a missing value.
We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.
Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:
1 2 3 4 5 6 7 8 9 |
from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # drop rows with missing values dataset.dropna(inplace=True) # summarize the number of rows and columns in the dataset print(dataset.shape) |
Running this example, we can see that the number of rows has been aggressively cut from 768 in the original dataset to 392 with all rows containing a NaN removed.
1 |
(392, 9) |
We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
from pandas import read_csv import numpy from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # drop rows with missing values dataset.dropna(inplace=True) # split dataset into inputs and outputs values = dataset.values X = values[:,0:8] y = values[:,8] # evaluate an LDA model on the dataset using k-fold cross validation model = LinearDiscriminantAnalysis() kfold = KFold(n_splits=3, random_state=7) result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy') print(result.mean()) |
The example runs successfully and prints the accuracy of the model.
1 |
0.78582892934 |
Removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.
5. Impute Missing Values
Imputing refers to using a model to replace missing values.
There are many options we could consider when replacing a missing value, for example:
- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median or mode value for the column.
- A value estimated by another predictive model.
Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.
For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.
Pandas provides the fillna() function for replacing missing values with a specific value.
For example, we can use fillna() to replace missing values with the mean value for each column, as follows:
1 2 3 4 5 6 7 8 9 |
from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values dataset.fillna(dataset.mean(), inplace=True) # count the number of NaN values in each column print(dataset.isnull().sum()) |
Running the example provides a count of the number of missing values in each column, showing zero missing values.
1 2 3 4 5 6 7 8 9 |
0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 |
The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values.
It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The Imputer class operates directly on the NumPy array instead of the DataFrame.
The example below uses the Imputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix.
1 2 3 4 5 6 7 8 9 10 11 12 |
from pandas import read_csv from sklearn.preprocessing import Imputer import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values values = dataset.values imputer = Imputer() transformed_values = imputer.fit_transform(values) # count the number of NaN values in each column print(numpy.isnan(transformed_values).sum()) |
Running the example shows that all NaN values were imputed successfully.
1 |
In either case, we can train algorithms sensitive to NaN values in the transformed dataset, such as LDA.
The example below shows the LDA algorithm trained in the Imputer transformed dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from pandas import read_csv import numpy from sklearn.preprocessing import Imputer from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # split dataset into inputs and outputs values = dataset.values X = values[:,0:8] y = values[:,8] # fill missing values with mean column values imputer = Imputer() transformed_X = imputer.fit_transform(X) # evaluate an LDA model on the dataset using k-fold cross validation model = LinearDiscriminantAnalysis() kfold = KFold(n_splits=3, random_state=7) result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy') print(result.mean()) |
Running the example prints the accuracy of LDA on the transformed dataset.
1 |
0.766927083333 |
Try replacing the missing values with other values and see if you can lift the performance of the model.
Maybe missing values have meaning in the data.
Next we will look at using algorithms that treat missing values as just another value when modeling.
6. Algorithms that Support Missing Values
Not all algorithms fail when there is missing data.
There are algorithms that can be made robust to missing data, such as k-Nearest Neighbors that can ignore a column from a distance measure when a value is missing.
There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees.
Sadly, the scikit-learn implementations of decision trees and k-Nearest Neighbors are not robust to missing values. Although it is being considered.
Nevertheless, this remains as an option if you consider using another algorithm implementation (such as xgboost) or developing your own implementation.
Further Reading
Summary
In this tutorial, you discovered how to handle machine learning data that contains missing values.
Specifically, you learned:
- How to mark missing values in a dataset as numpy.nan.
- How to remove rows from the dataset that contain missing values.
- How to replace missing values with sensible values.
Do you have any questions about handling missing values?
Ask your questions in the comments and I will do my best to answer.
Fancy impute is a library i’ve turned too for imputation:
https://github.com/hammerlab/fancyimpute
Also missingno is great for visualizations!
https://github.com/ResidentMario/missingno
Thanks for the tip Mike.
please tell me about how to impute median using one dataset
please tell me, in case use Fancy impute library, how to predict for X_test?
Thanks for pointing on interesting problem. I would love another one about how to deal with categorical attributes in Python.
And dear reader, please never ever remove rows with missing values. It changes the distribution of your data and your analyses may become worthless. Learn from mistakes of others and don’t repeat them 🙂
Thanks Jozo.
This post will help with categorical input data:
http://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/
Super duper! Thanks for writing! Would it have been worth mentioning interpolate of Pandas? http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html
Thanks Tommy.
Hi Jason,
I was just wondering if there is a way to use a different imputation strategy for each column. Say, for a categorical feature you want to impute using the mode but for a continuous attribute, you want to impute using mean.
Yes, try lots of techniques, go with whatever results in the most accurate models.
thanks for your tutorial sir.
I would also seek help from you for multi label classification of a textual data , if possible.
For example, categorizing a twitter post as related to sports, business , tech , or others.
Sure, see this post:
http://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
Hello Mr. Brownlee. Thank you so much for your post.
Do you know any approach to recognize the pattern of missing data? I mean, I am interested in discovering the pattern of missing data on a time series data.
The database is historical data of a chemical process. I think I should apply some pattern recognition approach columnwise because each column represents a process variable and the value coming from a transmisor.
My goal is to predict if the missing data is for a mechanical fault or a desviation in registration process or for any other causes. Then I should apply a kind of filling methods if it is required.
Have you any advice? Thanks in advance
I would invert the problem and model the series of missing data and mark all data you do have with a special value “0” and all missing instances as “1”.
Great problem!
Let me know how you go.
You helped me keep my sanity. THANK YOU!!
I’m really glad to hear that Patricia!
How to know whether to apply mean or to replace it with mode?
Try both and see what results in the most skillful models.
Hi Sachin,
Mode is effected by outliers whereas Mean is less effected by outliers.
Please correct me if i am wrong@Jason
I think you meant “Median” is not affected by outliers. “Mode” is just the most common value.
If I have a 11×11 table and there are 20 missing values in there, is there a way for me to make a code that creates a list after identifying these values?
Let us say that the first column got names and the first row has Day 1 to 10. Some of the names does not show up all of the days and therefore there are missing gaps. I put this table into the code and rather than reading the table I get a list with:
Name, day 2, day 5, day 7
Name, Day 1, day 6
I understand that this could take some time to answer, but if you are able to just tell me that this is possible and maybe know of good place to start on how to start on this project that would be of great help!
Sure, if the missing values are marked with a nan or similar, you can retrieve rows with missing values using Pandas.
can we code our own algorithms to impute the missing values????
if it is possible then how can i implement it??
Yes.
You can write some if-statements and fill in the n/a values in the Pandas dataframe.
I would recommend using statistics or a model as well and compare results.
Hi Jason,
I am trying to prepare data for the TITANIC dataset. One of the columns is CABIN which has values like ‘A22′,’B56’ and so on. This column has maximum number of missing values. First I thought to delete this column but I think this could be an important variable for predicting survivors.
I am trying to find a strategy to fill these null values. Is there a way to fill alphanumeric blank values?
If there is no automatic way, I was thinking of fill these records based on Name, number of sibling, parent child and class columns. E.g. for a missing value, try to see if there are any relatives and use their cabin number to replace missing value.
Similar case is for AGE column which is missing. Any thoughts?
Sounds like a categorical variable. You could encode them as integers. You could also assign an “unknown” integer value (e.g. -999) for the missing value.
Perhaps you can develop a model to predict the cabin number from other details and see if that is skilful.
Good day, I ran this file code pd.read_csv(r’C:\Users\Public\Documents\SP_dow_Hist_stock.csv’,sep=’,’).pct_change(252)
and it gave me missing values (NAN) of return of stock. How do I resolve it.
pd.read_csv(r’C:\Users\Public\Documents\SP_dow_Hist_stock.csv’,sep=’,’)
Out[5]:
Unnamed: 0 S&P500 Dow Jones
0 Date close Close
1 1-Jan-17 2,275.12 24719.22
2 1-Jan-16 1,918.60 19762.60
3 1-Jan-15 2,028.18 17425.03
4 1-Jan-14 1,822.36 17823.07
5 1-Jan-13 1,480.40 16576.66
6 1-Jan-12 1,300.58 13104.14
7 1-Jan-11 1,282.62 12217.56
8 1-Jan-10 1,123.58 11577.51
9 1-Jan-09 865.58 10428.05
10 1-Jan-08 1,378.76 8776.39
11 1-Jan-07 1,424.16 13264.82
12 1-Jan-06 1,278.73 12463.15
13 1-Jan-05 1,181.41 10717.50
14 1-Jan-04 1,132.52 10783.01
15 1-Jan-03 895.84 10453.92
16 1-Jan-02 1,140.21 8341.63
17 1-Jan-01 1,335.63 10021.57
18 1-Jan-00 1,425.59 10787.99
19 1-Jan-99 1,248.77 11497.12
20 1-Jan-98 963.36 9181.43
21 1-Jan-97 766.22 7908.25
22 1-Jan-96 614.42 6448.27
23 1-Jan-95 465.25 5117.12
24 1-Jan-94 472.99 3834.44
25 1-Jan-93 435.23 3754.09
26 1-Jan-92 416.08 3301.11
27 1-Jan-91 325.49 3168.83
28 1-Jan-90 339.97 2633.66
29 1-Jan-89 285.4 2753.20
.. … … …
68 1-Jan-50 16.88 235.42
69 1-Jan-49 15.36 200.52
70 1-Jan-48 14.83 177.30
71 1-Jan-47 15.21 181.16
72 1-Jan-46 18.02 177.20
73 1-Jan-45 13.49 192.91
74 1-Jan-44 11.85 151.93
75 1-Jan-43 10.09 135.89
76 1-Jan-42 8.93 119.40
77 1-Jan-41 10.55 110.96
78 1-Jan-40 12.3 131.13
79 1-Jan-39 12.5 149.99
80 1-Jan-38 11.31 154.36
81 1-Jan-37 17.59 120.85
82 1-Jan-36 13.76 179.90
83 1-Jan-35 9.26 144.13
84 1-Jan-34 10.54 104.04
85 1-Jan-33 7.09 98.67
86 1-Jan-32 8.3 60.26
87 1-Jan-31 15.98 77.90
88 1-Jan-30 21.71 164.58
89 1-Jan-29 24.86 248.48
90 1-Jan-28 17.53 300.00
91 1-Jan-27 13.4 200.70
92 1-Jan-26 12.65 157.20
93 1-Jan-25 10.58 156.66
94 1-Jan-24 8.83 120.51
95 1-Jan-23 8.9 95.52
96 1-Jan-22 7.3 98.17
97 1-Jan-21 7.11 80.80
[98 rows x 3 columns]
pd.read_csv(r’C:\Users\Public\Documents\SP_dow_Hist_stock.csv’,sep=’,’).pct_change(251)
Out[7]:
Unnamed: 0 S&P500 Dow Jones
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 NaN NaN NaN
9 NaN NaN NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN NaN
16 NaN NaN NaN
17 NaN NaN NaN
18 NaN NaN NaN
19 NaN NaN NaN
20 NaN NaN NaN
21 NaN NaN NaN
22 NaN NaN NaN
23 NaN NaN NaN
24 NaN NaN NaN
25 NaN NaN NaN
26 NaN NaN NaN
27 NaN NaN NaN
28 NaN NaN NaN
29 NaN NaN NaN
.. … … …
68 NaN NaN NaN
69 NaN NaN NaN
70 NaN NaN NaN
71 NaN NaN NaN
72 NaN NaN NaN
73 NaN NaN NaN
74 NaN NaN NaN
75 NaN NaN NaN
76 NaN NaN NaN
77 NaN NaN NaN
78 NaN NaN NaN
79 NaN NaN NaN
80 NaN NaN NaN
81 NaN NaN NaN
82 NaN NaN NaN
83 NaN NaN NaN
84 NaN NaN NaN
85 NaN NaN NaN
86 NaN NaN NaN
87 NaN NaN NaN
88 NaN NaN NaN
89 NaN NaN NaN
90 NaN NaN NaN
91 NaN NaN NaN
92 NaN NaN NaN
93 NaN NaN NaN
94 NaN NaN NaN
95 NaN NaN NaN
96 NaN NaN NaN
97 NaN NaN NaN
[98 rows x 3 columns]
Perhaps post your code and issue to stackoverflow?
Hi Jason,
Thanks for your valuable writing.
I have one question :-
We can also replace NaN values with Pandas fillna() function. In my opinion this is more versatile than Imputer class because in a single statement we can take different strategies on different column.
df.fillna({‘A’:df[‘A’].mean(),’B’:0,’C’:df[‘C’].min(),’D’:3})
What is your opinion? Is there any performance difference between two?
Great tip.
No idea, on the performance difference.
Is there a recommended ratio on the number of NaN values to valid values , when any corrective action like imputing can be taken?
If we have a column with most of the values as null, then it would be better off to ignore that column altogether for feature?
No, it is problem specific. Perhaps run some experiments to see how sensitive the model is to missing values.
Hi Jason,
Thanks for this post, I wanted to ask, how do we impute missing text values in a column which has either text labels or blanks.
Good question, I’m not sure off hand. Perhaps start with simple masking of missing values.
To fill the nan for a categorical column
df = df.fillna(df[‘column’].value_counts().index[0])
This fills the missing values in all columns with the most frequent categorical value
Thanks a lot Jason ! but I have a little question, how about if we want to replace missing values with the mean of each ROW not column ? how to do that ? if you have any clue, please tell me.. Thank you again Jason.
Why would you do this?
numpy.mean() allows you to specify the axis on which to calculate the mean. It will do it for you.
Hi Jason,
I wanted to ask you how you would deal with missing timestamps (date-time values), which are one set of predictor variables in a classification problem. Would you flag and mark them as missing or impute them as the mode of the rest of the timestamps?
Here are some ideas:
https://machinelearningmastery.com/handle-missing-timesteps-sequence-prediction-problems-python/
Hi Jason,
A big fan of yours.
I have a question about imputing missing numerical values. I don’t really want to remove them and I want to impute them to a value that is like Nan but a numerical type? Would say coding it to -1 work? (0 is already being used).
I guess I am trying to achieve the same thing as categorising an nan category variable to unknown and creating another feature column to indicate that it is missing.
Thanks,
NaN is a numerical type. It is a valid float.
You could use -999 or whatever you like.
Be careful that your model can support them, or normalize values prior to modeling.
Hello Jason,
You mentioned this here: “if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.”, but I wanted to ask:
Would imputing the data before creating the training and test set (with the data set’s mean) cause data leakage? What would be the best approach to tackle missing data within the data pipeline for a machine learning project.
Let’s say I’m imputing by filling in with the mean. For the model tuning am I imputing values in the test set with the training set’s mean?
Yes. You want to calculate the value to impute from train and apply to test.
The sklearn library has an imputer you can use in a pipeline:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
Hi Jason,
Thanks again for that huge nice article!
is there a neat way to clean away all those rows that happen to be filled with text (i.e. strings) in a certain column, i.e. List.ImportantColumn .
This destroys my plotting with “could not convert string to float”
Thanks already in advance!
Yes, you can remove or replace those values with simple NumPy array indexing.
For example, if you have ‘?’ you can do:
Hi Jason,
I tried using this dropna to delete the entire row that has missing values in my dataset and after which the isnull().sum() on the dataset also showed zero null values. But the problem arises when i run an algorithm and i am getting an error.
Error : Input contains NaN, infinity or a value too large for dtype(‘float64’)
This clearly shows there still exists some null values.
How do i proceed with this thanks in advance
Perhaps print the contents of the prepared data to confirm that the nans were indeed removed?
Hi Jason,
Thanks for this post, I’m using CNN for regression and after data normalization I found some NaN values on training samples. How can I use imputer to fill missing values in the data after normalization.
Does the above tutorial not help?
should I apply Imputer function for both training and testing dataset?
Yes, but if the imputer has to learn/estimate, it should be developed from the training data and aplied to the train and test sets, in order to avoid data leakage.
I feel that Imputer remove the Nan values and doesn’t replace them. For example the vector features length in my case is 14 and there are 2 Nan values after applying Imputer function the vector length is 12. This means the 2 Nan values are removed. However I used the following setting:
imputer = Imputer(missing_values=np.nan, strategy=’mean’, axis=0)
I don’t know what is happening in your case, perhaps post/search on stackoverflow?
You mean I should fit it on training data then applied to the train and test sets as follow :
imputer = Imputer(strategy=”mean”, axis=0)
imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)
Looks good.
Thanks for this post!!!
A dataSet having more than 4000 rows and rows can be groupby their 1st columns and let there is many columns (assume 20 columns) and few columns(let 14 columns) contains NaN(missing value).
How we populate NaN with mean of their corresponding columns by iterative method(using groupby, transform and apply) .
Sorry, I don’t understand. Perhaps you can elaborate your question?
actually i want to fill missing value in each column. Value is the mean of corresponding column. Is there any iterative method?
What do you mean by iterative method?
After replacing zeroes,Can I save it as a new data set?
Yes, call to_csv() on the dataframe.
what does this mean?
It is a function, learn more here:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html
import numpy as np
import pandas as pd
mydata = pd.read_csv(‘diabetes.csv’,header=None)
mydata.head(20)
0 1 2 3 4 5 6 7 8
0 Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
1 6 148 72 35 0 33.6 0.627 50 1
2 1 85 66 29 0 26.6 0.351 31 0
3 8 183 64 0 0 23.3 0.672 32 1
4 1 89 66 23 94 28.1 0.167 21 0
5 0 137 40 35 168 43.1 2.288 33 1
print((mydata[0] == 0).sum()) — for any column it always shows 0
0 >>>>>>>…. any thing wrong here ?
whereas i have 0’s in dataset
0 Pregnancies
1 6
2 1
3 8
4 1
5 0>>>>>>>>>
6 5
7 3
8 10
9 2
10 8
11 4
12 10
13 10
14 1
15 5
16 7
17 0 >>>>>>
Perhaps post your code and issue to stackoverflow?