4 Automatic Outlier Detection Algorithms in Python

By Jason Brownlee on August 17, 2020 in Data Preparation 70

The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance.

Identifying and removing outliers is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline and compared, just like other data preparation transforms that may be applied to the dataset.

In this tutorial, you will discover how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

After completing this tutorial, you will know:

Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Model-Based Outlier Detection and Removal in Python
Photo by Zoltán Vörös, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Outlier Detection and Removal
Dataset and Performance Baseline
1. House Price Regression Dataset
2. Baseline Model Performance
Automatic Outlier Detection
1. Isolation Forest
2. Minimum Covariance Determinant
3. Local Outlier Factor
4. One-Class SVM

Outlier Detection and Removal

Outliers are observations in a dataset that don’t fit in some way.

Perhaps the most common or familiar type of outlier is the observations that are far from the rest of the observations or the center of mass of observations.

This is easy to understand when we have one or two variables and we can visualize the data as a histogram or scatter plot, although it becomes very challenging when we have many input variables defining a high-dimensional input feature space.

In this case, simple statistical methods for identifying outliers can break down, such as methods that use standard deviations or the interquartile range.

It can be important to identify and remove outliers from data when training machine learning algorithms for predictive modeling.

Outliers can skew statistical measures and data distributions, providing a misleading representation of the underlying data and relationships. Removing outliers from training data prior to modeling can result in a better fit of the data and, in turn, more skillful predictions.

Thankfully, there are a variety of automatic model-based methods for identifying outliers in input data. Importantly, each method approaches the definition of an outlier is slightly different ways, providing alternate approaches to preparing a training dataset that can be evaluated and compared, just like any other data preparation step in a modeling pipeline.

Before we dive into automatic outlier detection methods, let’s first select a standard machine learning dataset that we can use as the basis for our investigation.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Dataset and Performance Baseline

In this section, we will first select a standard machine learning dataset and establish a baseline in performance on this dataset.

This will provide the context for exploring the outlier identification and removal method of data preparation in the next section.

House Price Regression Dataset

We will use the house price regression dataset.

This dataset has 13 input variables that describe the properties of the house and suburb and requires the prediction of the median value of houses in the suburb in thousands of dollars.

You can learn more about the dataset here:

No need to download the dataset as we will download it automatically as part of our worked examples.

Open the dataset and review the raw data. The first few rows of data are listed below.

We can see that it is a regression predictive modeling problem with numerical input variables, each of which has different scales.

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00
0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60
0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70
0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40
0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20
...

0.00632,18.00,2.310,0,0.5380,6.5750,65.20,4.0900,1,296.0,15.30,396.90,4.98,24.00

0.02731,0.00,7.070,0,0.4690,6.4210,78.90,4.9671,2,242.0,17.80,396.90,9.14,21.60

0.02729,0.00,7.070,0,0.4690,7.1850,61.10,4.9671,2,242.0,17.80,392.83,4.03,34.70

0.03237,0.00,2.180,0,0.4580,6.9980,45.80,6.0622,3,222.0,18.70,394.63,2.94,33.40

0.06905,0.00,2.180,0,0.4580,7.1470,54.20,6.0622,3,222.0,18.70,396.90,5.33,36.20

...

The dataset has many numerical input variables that have unknown and complex relationships. We don’t know that outliers exist in this dataset, although we may guess that some outliers may be present.

The example below loads the dataset and splits it into the input and output columns, splits it into train and test datasets, then summarizes the shapes of the data arrays.

# load and summarize the dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# summarize the shape of the dataset
print(X.shape, y.shape)
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the train and test sets
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# load and summarize the dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv(url, header=None)

# retrieve the array

data = df.values

# split into input and output elements

X, y = data[:, :-1], data[:, -1]

# summarize the shape of the dataset

print(X.shape, y.shape)

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize the shape of the train and test sets

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example, we can see that the dataset was loaded correctly and that there are 506 rows of data with 13 input variables and a single target variable.

The dataset is split into train and test sets with 339 rows used for model training and 167 for model evaluation.

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)

1 2	(506, 13) (506,) (339, 13) (167, 13) (339,) (167,)

Next, let’s evaluate a model on this dataset and establish a baseline in performance.

Baseline Model Performance

It is a regression predictive modeling problem, meaning that we will be predicting a numeric value. All input variables are also numeric.

In this case, we will fit a linear regression algorithm and evaluate model performance by training the model on the test dataset and making a prediction on the test data and evaluate the predictions using the mean absolute error (MAE).

The complete example of evaluating a linear regression model on the dataset is listed below.

# evaluate model on the raw dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

# evaluate model on the raw dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_absolute_error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv(url, header=None)

# retrieve the array

data = df.values

# split into input and output elements

X, y = data[:, :-1], data[:, -1]

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# fit the model

model = LinearRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model achieved a MAE of about 3.417. This provides a baseline in performance to which we can compare different outlier identification and removal procedures.

MAE: 3.417

1	MAE: 3.417

Next, we can try removing outliers from the training dataset.

Automatic Outlier Detection

The scikit-learn library provides a number of built-in automatic methods for identifying outliers in data.

In this section, we will review four methods and compare their performance on the house price dataset.

Each method will be defined, then fit on the training dataset. The fit model will then predict which examples in the training dataset are outliers and which are not (so-called inliers). The outliers will then be removed from the training dataset, then the model will be fit on the remaining examples and evaluated on the entire test dataset.

It would be invalid to fit the outlier detection method on the entire training dataset as this would result in data leakage. That is, the model would have access to data (or information about the data) in the test set not used to train the model. This may result in an optimistic estimate of model performance.

We could attempt to detect outliers on “new data” such as the test set prior to making a prediction, but then what do we do if outliers are detected?

One approach might be to return a “None” indicating that the model is unable to make a prediction on those outlier cases. This might be an interesting extension to explore that may be appropriate for your project.

Isolation Forest

Isolation Forest, or iForest for short, is a tree-based anomaly detection algorithm.

It is based on modeling the normal data in such a way as to isolate anomalies that are both few in number and different in the feature space.

… our proposed method takes advantage of two anomalies’ quantitative properties: i) they are the minority consisting of fewer instances and ii) they have attribute-values that are very different from those of normal instances.

— Isolation Forest, 2008.

The scikit-learn library provides an implementation of Isolation Forest in the IsolationForest class.

Perhaps the most important hyperparameter in the model is the “contamination” argument, which is used to help estimate the number of outliers in the dataset. This is a value between 0.0 and 0.5 and by default is set to 0.1.

...
# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)

...

# identify outliers in the training dataset

iso = IsolationForest(contamination=0.1)

yhat = iso.fit_predict(X_train)

Once identified, we can remove the outliers from the training dataset.

...
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]

...

# select all rows that are not outliers

mask = yhat != -1

X_train, y_train = X_train[mask, :], y_train[mask]

Tying this together, the complete example of evaluating the linear model on the housing dataset with outliers identified and removed with isolation forest is listed below.

# evaluate model performance with outliers removed using isolation forest
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import IsolationForest
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
iso = IsolationForest(contamination=0.1)
yhat = iso.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

# evaluate model performance with outliers removed using isolation forest

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.ensemble import IsolationForest

from sklearn.metrics import mean_absolute_error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv(url, header=None)

# retrieve the array

data = df.values

# split into input and output elements

X, y = data[:, :-1], data[:, -1]

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize the shape of the training dataset

print(X_train.shape, y_train.shape)

# identify outliers in the training dataset

iso = IsolationForest(contamination=0.1)

yhat = iso.fit_predict(X_train)

# select all rows that are not outliers

mask = yhat != -1

X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset

print(X_train.shape, y_train.shape)

# fit the model

model = LinearRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that that model identified and removed 34 outliers and achieved a MAE of about 3.189, an improvement over the baseline that achieved a score of about 3.417.

(339, 13) (339,)
(305, 13) (305,)
MAE: 3.189

(339, 13) (339,)

(305, 13) (305,)

MAE: 3.189

Minimum Covariance Determinant

If the input variables have a Gaussian distribution, then simple statistical methods can be used to detect outliers.

For example, if the dataset has two input variables and both are Gaussian, then the feature space forms a multi-dimensional Gaussian and knowledge of this distribution can be used to identify values far from the distribution.

This approach can be generalized by defining a hypersphere (ellipsoid) that covers the normal data, and data that falls outside this shape is considered an outlier. An efficient implementation of this technique for multivariate data is known as the Minimum Covariance Determinant, or MCD for short.

The Minimum Covariance Determinant (MCD) method is a highly robust estimator of multivariate location and scatter, for which a fast algorithm is available. […] It also serves as a convenient and efficient tool for outlier detection.

— Minimum Covariance Determinant and Extensions, 2017.

The scikit-learn library provides access to this method via the EllipticEnvelope class.

It provides the “contamination” argument that defines the expected ratio of outliers to be observed in practice. In this case, we will set it to a value of 0.01, found with a little trial and error.

...
# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)

...

# identify outliers in the training dataset

ee = EllipticEnvelope(contamination=0.01)

yhat = ee.fit_predict(X_train)

Once identified, the outliers can be removed from the training dataset as we did in the prior example.

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the elliptical envelope (minimum covariant determinant) method is listed below.

# evaluate model performance with outliers removed using elliptical envelope
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.covariance import EllipticEnvelope
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
ee = EllipticEnvelope(contamination=0.01)
yhat = ee.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

# evaluate model performance with outliers removed using elliptical envelope

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.covariance import EllipticEnvelope

from sklearn.metrics import mean_absolute_error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv(url, header=None)

# retrieve the array

data = df.values

# split into input and output elements

X, y = data[:, :-1], data[:, -1]

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize the shape of the training dataset

print(X_train.shape, y_train.shape)

# identify outliers in the training dataset

ee = EllipticEnvelope(contamination=0.01)

yhat = ee.fit_predict(X_train)

# select all rows that are not outliers

mask = yhat != -1

X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset

print(X_train.shape, y_train.shape)

# fit the model

model = LinearRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that the elliptical envelope method identified and removed only 4 outliers, resulting in a drop in MAE from 3.417 with the baseline to 3.388.

(339, 13) (339,)
(335, 13) (335,)
MAE: 3.388

(339, 13) (339,)

(335, 13) (335,)

MAE: 3.388

Local Outlier Factor

A simple approach to identifying outliers is to locate those examples that are far from the other examples in the feature space.

This can work well for feature spaces with low dimensionality (few features), although it can become less reliable as the number of features is increased, referred to as the curse of dimensionality.

The local outlier factor, or LOF for short, is a technique that attempts to harness the idea of nearest neighbors for outlier detection. Each example is assigned a scoring of how isolated or how likely it is to be outliers based on the size of its local neighborhood. Those examples with the largest score are more likely to be outliers.

We introduce a local outlier (LOF) for each object in the dataset, indicating its degree of outlier-ness.

— LOF: Identifying Density-based Local Outliers, 2000.

The scikit-learn library provides an implementation of this approach in the LocalOutlierFactor class.

The model provides the “contamination” argument, that is the expected percentage of outliers in the dataset, be indicated and defaults to 0.1.

...
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)

...

# identify outliers in the training dataset

lof = LocalOutlierFactor()

yhat = lof.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the local outlier factor method is listed below.

# evaluate model performance with outliers removed using local outlier factor
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import LocalOutlierFactor
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
lof = LocalOutlierFactor()
yhat = lof.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

# evaluate model performance with outliers removed using local outlier factor

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.neighbors import LocalOutlierFactor

from sklearn.metrics import mean_absolute_error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv(url, header=None)

# retrieve the array

data = df.values

# split into input and output elements

X, y = data[:, :-1], data[:, -1]

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize the shape of the training dataset

print(X_train.shape, y_train.shape)

# identify outliers in the training dataset

lof = LocalOutlierFactor()

yhat = lof.fit_predict(X_train)

# select all rows that are not outliers

mask = yhat != -1

X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset

print(X_train.shape, y_train.shape)

# fit the model

model = LinearRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that the local outlier factor method identified and removed 34 outliers, the same number as isolation forest, resulting in a drop in MAE from 3.417 with the baseline to 3.356. Better, but not as good as isolation forest, suggesting a different set of outliers were identified and removed.

(339, 13) (339,)
(305, 13) (305,)
MAE: 3.356

(339, 13) (339,)

(305, 13) (305,)

MAE: 3.356

One-Class SVM

The support vector machine, or SVM, algorithm developed initially for binary classification can be used for one-class classification.

When modeling one class, the algorithm captures the density of the majority class and classifies examples on the extremes of the density function as outliers. This modification of SVM is referred to as One-Class SVM.

… an algorithm that computes a binary function that is supposed to capture regions in input space where the probability density lives (its support), that is, a function such that most of the data will live in the region where the function is nonzero.

— Estimating the Support of a High-Dimensional Distribution, 2001.

Although SVM is a classification algorithm and One-Class SVM is also a classification algorithm, it can be used to discover outliers in input data for both regression and classification datasets.

The scikit-learn library provides an implementation of one-class SVM in the OneClassSVM class.

The class provides the “nu” argument that specifies the approximate ratio of outliers in the dataset, which defaults to 0.1. In this case, we will set it to 0.01, found with a little trial and error.

...
# identify outliers in the training dataset
ee = OneClassSVM(nu=0.01)
yhat = ee.fit_predict(X_train)

...

# identify outliers in the training dataset

ee = OneClassSVM(nu=0.01)

yhat = ee.fit_predict(X_train)

Tying this together, the complete example of identifying and removing outliers from the housing dataset using the one class SVM method is listed below.

# evaluate model performance with outliers removed using one class SVM
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.svm import OneClassSVM
from sklearn.metrics import mean_absolute_error
# load the dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
df = read_csv(url, header=None)
# retrieve the array
data = df.values
# split into input and output elements
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize the shape of the training dataset
print(X_train.shape, y_train.shape)
# identify outliers in the training dataset
ee = OneClassSVM(nu=0.01)
yhat = ee.fit_predict(X_train)
# select all rows that are not outliers
mask = yhat != -1
X_train, y_train = X_train[mask, :], y_train[mask]
# summarize the shape of the updated training dataset
print(X_train.shape, y_train.shape)
# fit the model
model = LinearRegression()
model.fit(X_train, y_train)
# evaluate the model
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

# evaluate model performance with outliers removed using one class SVM

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.svm import OneClassSVM

from sklearn.metrics import mean_absolute_error

# load the dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

df = read_csv(url, header=None)

# retrieve the array

data = df.values

# split into input and output elements

X, y = data[:, :-1], data[:, -1]

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize the shape of the training dataset

print(X_train.shape, y_train.shape)

# identify outliers in the training dataset

ee = OneClassSVM(nu=0.01)

yhat = ee.fit_predict(X_train)

# select all rows that are not outliers

mask = yhat != -1

X_train, y_train = X_train[mask, :], y_train[mask]

# summarize the shape of the updated training dataset

print(X_train.shape, y_train.shape)

# fit the model

model = LinearRegression()

model.fit(X_train, y_train)

# evaluate the model

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Running the example fits and evaluates the model, then reports the MAE.

In this case, we can see that only three outliers were identified and removed and the model achieved a MAE of about 3.431, which is not better than the baseline model that achieved 3.417. Perhaps better performance can be achieved with more tuning.

(339, 13) (339,)
(336, 13) (336,)
MAE: 3.431

(339, 13) (339,)

(336, 13) (336,)

MAE: 3.431

Summary

In this tutorial, you discovered how to use automatic outlier detection and removal to improve machine learning predictive modeling performance.

Specifically, you learned:

Automatic outlier detection models provide an alternative to statistical techniques with a larger number of input variables with complex and unknown inter-relationships.
How to correctly apply automatic outlier detection and removal to the training dataset only to avoid data leakage.
How to evaluate and compare predictive modeling pipelines with outliers removed from the training dataset.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

70 Responses to 4 Automatic Outlier Detection Algorithms in Python

Joseph July 8, 2020 at 7:00 pm #

Hi Jason, thanks for one more great article!
My question is about outliers in tree based algorithms (RF, XGboost). Does it really change model outcomes in real life to delete outliers in this case? Findings change over time, that’s why I’ve this question.

Reply
- Jason Brownlee July 9, 2020 at 6:39 am #
  
  I think trees are pretty robust to outliers. Test for your dataset.
  
  Reply
JParzival July 9, 2020 at 10:42 am #

Great article!
Thank you for sharing your experience!

Reply
- Jason Brownlee July 9, 2020 at 1:19 pm #
  
  You’re welcome.
  
  Reply
Nagdev. A July 10, 2020 at 9:39 am #

Two more to the list autoencoders and PCA

Reply
- Jason Brownlee July 10, 2020 at 1:47 pm #
  
  For outlier detection? How so?
  
  Reply
  - Ali November 19, 2020 at 4:20 pm #
    
    Actually, autoencoders can provide best performance for anomaly detection problems followed by PCA.
    
    Reply
    - Jason Brownlee November 20, 2020 at 6:43 am #
      
      Depends on the specific dataset.
      
      Reply
    - Ashima Chawla February 22, 2021 at 3:28 am #
      
      absolutely true! I have tried the same and it works pretty well with Autoencoders.
      
      Reply
  - Ali November 19, 2020 at 4:27 pm #
    
    Both Autoencoder and PCA are dimensionality reduction techniques. Interestingly, during the process of dimensionality reduction outliers are identified.
    
    Reply
Vishal July 10, 2020 at 10:18 pm #

Hello sir,
It was a great article. Just one doubt:
MCD technique doesn’t perform well when the data has very large dimensions like >1000. In that case, it is a good option to feed the model with principal components of the data. The paper that you mentioned in the link says:

“For large p we can still make a rough estimate of the scatter as follows. First compute the first q < p robust principal components of the data. For this we can use the MCD-based ROBPCA method53, which requires that the number of components q be set rather low."

Now the ROBPCA is not available in python. Can you please tell what can be done in this case?

Thank you

Reply
- Jason Brownlee July 11, 2020 at 6:13 am #
  
  Great tip, thanks.
  
  Perhaps find a different platform that implements the method?
  Perhaps implement it yourself?
  Perhaps use a different method entirely?
  
  Reply
fabou July 10, 2020 at 11:08 pm #

Hi Jason,

as usual great educational article.

How could automatic outlier detection be integrated into a cross validation loop? Does it have to be part of a pipeline which steps would be : outlier detection > outlier removal (transformer) > modeling?
In this case, should a specific transformer “outlier remover” be created?

Thanks

Reply
- Jason Brownlee July 11, 2020 at 6:16 am #
  
  You would have to run the CV loop manually and apply the method to the data prior to fitting/evaluating a model or pipeline.
  
  It’s disappointing that sklearn does not support methods in pipelines that add/remove rows. imbalanced learn can do this kind of thing…
  
  Reply
Chayma July 15, 2020 at 1:16 am #

Thank you for the great article.

Which algorithm is the most sutible for outlier detection in time series data?

Reply
- Jason Brownlee July 15, 2020 at 8:27 am #
  
  I don’t know off hand, I hope to write about that topic in the future.
  
  Reply
  - Ray February 21, 2021 at 12:07 pm #
    
    Greetings,
    Thanks for the awesome article. Did you ever get round to doing an article for time-series anomaly detection? I’m learning and would be keen to know your thoughts on it 🙂
    
    Reply
    - Jason Brownlee February 21, 2021 at 12:41 pm #
      
      Not yet, it’s on the TODO.
      
      Reply
Nagesh August 27, 2020 at 12:33 pm #

Thoughts on this onne ? https://github.com/arundo/adtk

Reply
- Jason Brownlee August 27, 2020 at 1:36 pm #
  
  I’m not familiar with it.
  
  Reply
Allen21 September 1, 2020 at 5:54 am #

If anyone is getting a TypeError with X_train[mask, :], just change it to X_train[mask]. Another great article BTW

Reply
- Jason Brownlee September 1, 2020 at 6:38 am #
  
  Sorry to hear that.
  
  Perhaps these tips will help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Vidya Manu Shankar September 10, 2020 at 1:51 pm #

Hi Jason .

Thanks for this post.
Couple of questions though:
1. How do we validate the output of the outlier detection algorithms mentioned in this post , whether the marked records are really the outliers ? Through boxplots ?
2. Also , why don’t we include the target variable as input to the outlier algorithms ? I missed this point ….

Reply
- Jason Brownlee September 11, 2020 at 5:48 am #
  
  Thanks.
  
  Good question, you can validate the model by either evaluating predictions on dataset with known outliers or inspecting identified outliers and using a subject matter expert to determine if they are true outliers or not.
  
  The algorithms are one-class algorithms, no target variable is required.
  
  Reply
Vidya September 12, 2020 at 2:38 am #

Awesome , thank you !

Reply
- Jason Brownlee September 12, 2020 at 6:18 am #
  
  You’re welcome.
  
  Reply
arnan maipradit September 13, 2020 at 10:29 am #

do you have any example of outlier detection using Q-learning, I found that Q-learning almost using in case of many actions (robot move up down left right so it has 4 actions) but in the case of outlier detection it has only 2 actions (normal behavior and outlier) which make me concern that Q-learning can be used on outlier detection (anomaly detection) or not ? . If you could make an example or suggest anything would be appreciated.

Reply
- Jason Brownlee September 14, 2020 at 6:44 am #
  
  Sorry, I do not have any examples or RL at this stage.
  
  Thanks for the suggestion.
  
  Reply
Arushi Mahajan October 14, 2020 at 1:45 pm #

Hi, amazing tutorial.
Just one question. How can you see all the rows that were dropped?

Reply
- Jason Brownlee October 14, 2020 at 1:51 pm #
  
  What do you mean by “dropped rows”?
  
  Reply
Marlon Lohrbach October 17, 2020 at 3:57 am #

Hey Jason,

I ve read about hyperparameter tuning of Isolation Forests etc. When all models/removing the detected outliers doesn’t really add value or doesn’t improve my baseline model’s scores: Do you think it makes sense to invest time into hyperparameter tuning of these anomaly detection models?

Plus: From my point of view those outliers seem to be legit to me…

Cheers again,

Marlon

Reply
- Jason Brownlee October 17, 2020 at 6:13 am #
  
  If it not improving performance, no.
  
  Reply
Mohammad Ali Shahlaei November 30, 2020 at 6:38 am #

I think he meant that the rows were identified as outliers (dropped rows)!

Reply
Gabriel December 7, 2020 at 4:06 am #

Thanks for such a great article. I have a question that is why we don’t apply the outlier detection algorithm to the whole dataset rather than only the training dataset ? It will not bother the accuracy of the model if there are outlier data in the test dataset ?

Reply
- Jason Brownlee December 7, 2020 at 6:22 am #
  
  We don’t the example only applies the automatic methods to the training dataset.
  
  Reply
Mitra December 20, 2020 at 1:42 am #

Hi sir! Thank you for the amazing content, Just wanted to point out one thing. In the Isolation Forests, documentation of Scikit learn I read that the default value for contamination is no longer 0.1 and it’s turned to auto. You can correct that part 🙂

Reply
- Jason Brownlee December 20, 2020 at 5:58 am #
  
  You’re welcome.
  
  Thanks for the suggestion.
  
  Reply
Mitra December 20, 2020 at 4:49 am #

One quick note! In the Minimum Covariance Determination method, you said we can use this method when our features are gaussian or gaussian-like, well in the dataset you’re using the features don’t have such shape. Most of them are skewed. I think we should first apply a transformation(log, box-cox, etc.) and then use this method on features with little or no skewness. I’m actually writing a Kaggle kernel on this and would love to hear what you think about it when it’s done!

Reply
- Jason Brownlee December 20, 2020 at 5:59 am #
  
  Generally, I’d recommend evaluating the approach with and without the data prep and use the approach that results in the best performance.
  
  Reply
Divya Jami December 29, 2020 at 2:18 am #

Amazing tutorial Sir!
Question- Should we always drop the rows containing outliers? Will outlier imputation work better in some cases?

Reply
- Jason Brownlee December 29, 2020 at 5:15 am #
  
  Thanks.
  
  It is a decision you must make on your prediction project.
  
  Reply
Uzma Naqvi January 25, 2021 at 8:53 pm #

Jason your effort is appreciable. Can you please tell me that can i apply outlier detection methods on text data used to classify sentiments? if yes then how

Reply
- Jason Brownlee January 26, 2021 at 5:53 am #
  
  Not sure what an “outlier” is when it comes to text, perhaps a rare word – in that case reduce the vocab to the most common words only.
  
  Reply
Sudhir January 28, 2021 at 3:18 am #

One question in IsolationForest:
I did not understand the line22 code: mask = yhat != -1
can you please elaborate.
-Beginner in data science

Reply
- Jason Brownlee January 28, 2021 at 6:07 am #
  
  The mask equals a true/false value based on whether values in yhat is -1 or not.
  
  Reply
Hugo Souza January 31, 2021 at 1:07 pm #

Hi Jason!

First, congrats and thanks for this interesting work!

Just one question: It’s possible to get the accuracy of LOF? More specifically, is it possible with the sklearn LOF library?

Reply
- Jason Brownlee February 1, 2021 at 6:22 am #
  
  Thanks!
  
  Yes, there is an LOF example in the above tutorial that you can use to get started.
  
  Reply
Jhonny February 25, 2021 at 11:40 pm #

Thanks for sharing

Reply
- Jason Brownlee February 26, 2021 at 4:59 am #
  
  You’re welcome!
  
  Reply
Majid April 23, 2021 at 5:20 am #

Hi Jason,

Could you please clarify which scaling (e.g. MinMax) would be more suitable in time series numerical data for anomaly detection with binary classification? Considering that most of the features are closed(having small differences from each other), numbers and some small abnormal changes make an anomaly in the system.

Thank you so much for your attention and participation.

Reply
- Jason Brownlee April 24, 2021 at 5:11 am #
  
  I recommend testing different methods and use the scaling that results in the best performance of your model on your dataset.
  
  Reply
Johnny May 2, 2021 at 1:17 am #

Hi Jason!

Do you think I should remove outliers before or after transforming the data? Or should I remove outliers before or after dimensionality reduction and feature selection? I know that outliers should be removed after transformation if you have non-parametric data, but does it matter if I use non-parametric outlier detection algorithms?

Reply
- Jason Brownlee May 2, 2021 at 5:33 am #
  
  Probably before, but it depends on the data and the transforms. Test to discover what works best for you!
  
  Reply
JG July 4, 2021 at 10:37 pm #

Hi Jason,

Great and useful tutorial about how to get rid off of data outliers.

I share my experiment performing two additional and consecutive implementation to your core code:

1º) I apply each outlier method to the whole dataset (X_train + X_test), not only to the X_train.
I got a better MAE result about 2.9 instead 3.19 using IsolationForest().

2º) Following this implementation I apply Columns Transformation, trough Sklearn API ColumnTransformer() .
Particularly I implement “MinMaxScaler()” to all X inputs and “StandardScaler” to Y ( Boston house pricing) output.
I got an excellent MAE equal to 0.3 for LOF method (after eliminating only 23 rows outliers even less than other methods such IsolationForest()).

Thank very much…you really boost our ML/DL skills !, thank you to your awesome tutorials!

I am very grateful to you.

regards,
JG

Reply
- Jason Brownlee July 5, 2021 at 5:08 am #
  
  Nice work!
  
  Reply
George July 23, 2021 at 7:34 am #

Dear, I can see you are only removing rows from training dataset (X_train) without labels (y_train). How you are suppossed to remove same element from target variable?

Reply
- Jason Brownlee July 24, 2021 at 5:09 am #
  
  Sorry, I don’t understand your question, perhaps you can rephrase or elaborate.
  
  Outlier removal is about removing rows from the dataset based on the input features.
  
  Reply
- Guest August 2, 2021 at 10:01 pm #
  
  I think he’s asking about how to remove the same rows of training on target. Just look for ‘mask’ in his code and that line marks the position (index is the technical term) of outliers.
  
  Reply
  - Jason Brownlee August 3, 2021 at 4:52 am #
    
    Thanks.
    
    Reply
LL November 8, 2021 at 8:23 pm #

Hi Jason,

Great article, I learned a lot! I have one question. Is it necessary to put these types of outlier method in scikit pipeline?

Reply
- Adrian Tam November 14, 2021 at 12:24 pm #
  
  Probably not necessary but you may consider that too. Especially if you think it helps or you have any reason to do that (e.g., in a production system and you don’t want to break a model when the input is erroneous) .
  
  Reply
Manuel M. June 20, 2022 at 10:57 pm #

Good afternoon,

All this content is great! Thank you!

Wondering if you have any suggestions for feature selection when building an outlier detection model?

Thanks

Reply
- James Carmichael June 21, 2022 at 9:33 am #
  
  Hi Manuel…You may find the following resources of interest:
  
  https://ieeexplore.ieee.org/document/7837865
  
  https://ojs.aaai.org/index.php/AAAI/article/view/5755/5611
  
  Reply
Sajad July 2, 2022 at 12:48 am #

Hello Jason,

Amazing article,

I just can’t get it how these methods can detect outliers?

Reply
- James Carmichael July 2, 2022 at 9:09 am #
  
  Hi Sajad…the following resources may be of interest to you:
  
  https://machinelearningmastery.com/how-to-use-statistics-to-identify-outliers-in-data/
  
  https://machinelearningmastery.com/anomaly-detection-with-isolation-forest-and-kernel-density-estimation/
  
  Reply
Merve October 21, 2022 at 6:22 am #

Hi Jason,
Thanks for the great article.
I have a question about the Isolation Forest, with this algorithm we can detect outliers based on features, and then we can delete those records, but I wonder how can we replace these records with median/mean instead of deleting them.

Thanks

Reply
- James Carmichael October 21, 2022 at 7:41 am #
  
  Hi Merve…You may find the following of interest:
  
  https://stackoverflow.com/questions/45386955/python-replacing-outliers-values-with-median-values
  
  Reply
Prem December 13, 2022 at 9:49 am #

Hi,

May I please know

when to use, Gini Index, KS Statistic and p-value in https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/

Thank you very much

Reply
Ma January 10, 2023 at 1:20 pm #

I appreciate your efforts and your sharing of knowledge sir.
i just got a small simple query . what works better, whether the automatic outlier detection techniques or the manual technique (IQR) generally ?

Reply
- James Carmichael January 11, 2023 at 8:03 am #
  
  Hi Ma…You are very welcome! Thank you for your feedback and support!
  
  It is difficult in general to predict which method is “best”. We recommend that you investigate both options and compare the performance.
  
  Let us know what you find!
  
  Reply

Navigation

4 Automatic Outlier Detection Algorithms in Python

Tutorial Overview

Outlier Detection and Removal

Want to Get Started With Data Preparation?

Dataset and Performance Baseline

House Price Regression Dataset

Baseline Model Performance

Automatic Outlier Detection

Isolation Forest

Minimum Covariance Determinant

Local Outlier Factor

One-Class SVM

Further Reading

Related Tutorials

Papers

APIs

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects

More On This Topic

70 Responses to 4 Automatic Outlier Detection Algorithms in Python

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Outlier Detection and Removal

Want to Get Started With Data Preparation?

Dataset and Performance Baseline

House Price Regression Dataset

Baseline Model Performance

Automatic Outlier Detection

Isolation Forest

Minimum Covariance Determinant

Local Outlier Factor

One-Class SVM

Further Reading

Related Tutorials

Papers

APIs

Summary

Get a Handle on Modern Data Preparation!

Prepare Your Machine Learning Data in Minutes

Bring Modern Data Preparation Techniques to Your Machine Learning Projects

More On This Topic

70 Responses to 4 Automatic Outlier Detection Algorithms in Python

Leave a Reply Click here to cancel reply.

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects