How to Identify Outliers in your Data

Bojan Miletic asked a question about outlier detection in datasets when working with machine learning algorithms.

This post is in answer to his question.

If you have a question about machine learning, sign-up to the newsletter and reply to an email or use the contact form and ask, I will answer your question and may even turn it into a blog post.

Kick-start your project with my new book Data Preparation for Machine Learning, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Outliers

Many machine learning algorithms are sensitive to the range and distribution of attribute values in the input data.

Outliers in input data can skew and mislead the training process of machine learning algorithms resulting in longer training times, less accurate models and ultimately poorer results.

Outlier

Outlier
Photo by Robert S. Donovan, some rights reserved

Even before predictive models are prepared on training data, outliers can result in misleading representations and in turn misleading interpretations of collected data. Outliers can skew the summary distribution of attribute values in descriptive statistics like mean and standard deviation and in plots such as histograms and scatterplots, compressing the body of the data.

Finally, outliers can represent examples of data instances that are relevant to the problem such as anomalies in the case of fraud detection and computer security.

Want to Get Started With Data Preparation?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Outlier Modeling

Outliers are extreme values that fall a long way outside of the other observations. For example, in a normal distribution, outliers may be values on the tails of the distribution.

The process of identifying outliers has many names in data mining and machine learning such as outlier mining, outlier modeling and novelty detection and anomaly detection.

In his book Outlier Analysis, Aggarwal provides a useful taxonomy of outlier detection methods, as follows:

  • Extreme Value Analysis: Determine the statistical tails of the underlying distribution of the data. For example, statistical methods like the z-scores on univariate data.
  • Probabilistic and Statistical Models: Determine unlikely instances from a probabilistic model of the data. For example, gaussian mixture models optimized using expectation-maximization.
  • Linear Models: Projection methods that model the data into lower dimensions using linear correlations. For example, principle component analysis and data with large residual errors may be outliers.
  • Proximity-based Models: Data instances that are isolated from the mass of the data as determined by cluster, density or nearest neighbor analysis.
  • Information Theoretic Models: Outliers are detected as data instances that increase the complexity (minimum code length) of the dataset.
  • High-Dimensional Outlier Detection: Methods that search subspaces for outliers give the breakdown of distance based measures in higher dimensions (curse of dimensionality).

Aggarwal comments that the interpretability of an outlier model is critically important. Context or rationale is required around decisions why a specific data instance is or is not an outlier.

In his contributing chapter to Data Mining and Knowledge Discovery Handbook, Irad Ben-Gal proposes a taxonomy of outlier models as univariate or multivariate and parametric and nonparametric. This is a useful way to structure methods based on what is known about the data. For example:

  • Are you considered with outliers in one or more than one attributes (univariate or multivariate methods)?
  • Can you assume a statistical distribution from which the observations were sampled or not (parametric or nonparametric)?

Get Started

There are many methods and much research put into outlier detection. Start by making some assumptions and design experiments where you can clearly observe the effects of the those assumptions against some performance or accuracy measure.

I recommend working through a stepped process from extreme value analysis, proximity methods and projection methods.

Extreme Value Analysis

You do not need to know advanced statistical methods to look for, analyze and filter out outliers from your data. Start out simple with extreme value analysis.

  • Focus on univariate methods
  • Visualize the data using scatterplots, histograms and box and whisker plots and look for extreme values
  • Assume a distribution (Gaussian) and look for values more than 2 or 3 standard deviations from the mean or 1.5 times from the first or third quartile
  • Filter out outliers candidate from training dataset and assess your models performance

Proximity Methods

Once you have explore simpler extreme value methods, consider moving onto proximity-based methods.

  • Use clustering methods to identify the natural clusters in the data (such as the k-means algorithm)
  • Identify and mark the cluster centroids
  • Identify data instances that are a fixed distance or percentage distance from cluster centroids
  • Filter out outliers candidate from training dataset and assess your models performance

Projection Methods

Projection methods are relatively simple to apply and quickly highlight extraneous values.

  • Use projection methods to summarize your data to two dimensions (such as PCA, SOM or Sammon’s mapping)
  • Visualize the mapping and identify outliers by hand
  • Use proximity measures from projected values or codebook vectors to identify outliers
  • Filter out outliers candidate from training dataset and assess your models performance

Methods Robust to Outliers

An alternative strategy is to move to models that are robust to outliers. There are robust forms of regression that minimize the median least square errors rather than mean (so-called robust regression), but are more computationally intensive. There are also methods like decision trees that are robust to outliers.

You could spot check some methods that are robust to outliers. If there are significant model accuracy benefits then there may be an opportunity to model and filter out outliers from your training data.

Resources

There are a lot of webpages that discuss outlier detection, but I recommend reading through a good book on the subject, something more authoritative. Even looking through introductory books on machine learning and data mining won’t be that useful to you. For a classical treatment of outliers by statisticians, check out:

For a modern treatment of outliers by data mining community, see:

Get a Handle on Modern Data Preparation!

Data Preparation for Machine Learning

Prepare Your Machine Learning Data in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Data Preparation for Machine Learning

It provides self-study tutorials with full working code on:
Feature Selection, RFE, Data Cleaning, Data Transforms, Scaling, Dimensionality Reduction, and much more...

Bring Modern Data Preparation Techniques to
Your Machine Learning Projects


See What's Inside

56 Responses to How to Identify Outliers in your Data

  1. Avatar
    Sandeep Karkhanis February 7, 2015 at 12:44 am #

    great blog, I have few of your mini guides and really love them.
    For a newbie in ML and python your books just cut the crap and help me get started…

    few questions,

    Q1
    Would you consider writing a mini-book actually showing implementation of ANY or ALL of the ways you described below?

    Q2
    imagine if you have ‘n’ numeric predictors, numeric target and each of them have Na’s / Nan’s in the range of 40-60% values…and lots of outliers
    So what approach would you take,
    2.1. Impute the Nan’s first
    2.2. then use your outlier function to remove outliers
    or the other way around?

    I tried using the scikit imputer in step 2.1 above but didn’t work ..any suggestions?

    http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer

    • Avatar
      Jason Brownlee February 7, 2015 at 6:33 am #

      Q1: Sure.
      Q2: That is a not a lot of data and it may be hard to know the structure of your data. “Many” and “outliers” do not go together. But yes, your approach sounds reasonable.

      Try imputing with a mean, median or knn by hand as a starting point.

  2. Avatar
    atul August 20, 2015 at 7:34 pm #

    Hi Jason,

    I have a month-wise data where same months can have multiple entries. The issue is there are outliers only in some months and not all but the data is in millions. Plus there is no way of selectively removing the outliers.
    Which approach do you suggest?

    E.g. A user born on 1984, buys 10 items of difference cumulative prices in June 2015, which again gets add up in next month, say July 2015. So he will have 10 entries for June, where the recent entry should have maximum amount. Issue is the data is manually entered by someone so values are pretty random. I want to select the most logical value in a month for that subscriber. We can straightway remove the outliers to get a proper trend.

  3. Avatar
    Tobi Adeyemi December 17, 2017 at 3:50 am #

    Hi Jason, still waiting for the tutorial on implementation of the outlier detection methods. Furthermore, can you also consider a comprehensive discussion on anomaly detection in time series data.

    Thank you

  4. Avatar
    Ayush Mandloi January 4, 2018 at 12:55 am #

    I am trying to do Enron dataset problem of Udacity please help me how should i start.

  5. Avatar
    Toni January 29, 2018 at 4:17 am #

    Hi,
    Does output outlier detection proven to improve predictions results? Is it needed at all or just input outliers detection is needed?

    • Avatar
      Jason Brownlee January 29, 2018 at 8:17 am #

      It is something you can try to see if it lifts model skill on your specific dataset.

      Some algorithms may perform better, such as linear methods.

  6. Avatar
    Jesús Martínez February 11, 2018 at 3:01 am #

    Thanks for the insight about outlier detection. They’re always tricky to deal with!

    Are deep learning algorithms such as Convolutional Neural Networks and Recurrent Neural Network robust against outliers? Given that one of the biggest advantages of deep neural networks is that they perform they own feature selection under the hood, I’m curious about if they’re capable of dealing with outliers on their own as well.

    Thanks in advance for your time!

    • Avatar
      Jason Brownlee February 11, 2018 at 7:57 am #

      Sort of. Clean data is often better if possible.

  7. Avatar
    ratiratana February 15, 2018 at 11:15 am #

    Thank you for the article , it help me more clear about the problem of how to manage outlier in training data set. I follow your blog in many topic. Your language is easy to read understanding . Thank you so much for your contribution.

  8. Avatar
    Ajay verma March 6, 2018 at 1:15 pm #

    Very nice explanaion

  9. Avatar
    Alex March 15, 2018 at 8:04 pm #

    Thanks for a so well documented procedure.

    About the issue of outliers, from my real experience in real datasets like Wind turbines, the indentified as outliers tends to be the rows that indicates a failure, this means if you remove them you are removing the failure patterns(or target labeling) that you want to model.

    This is weird since I tested remove outliers with univariate, pca, denoisy autoencoder and all of them are in fact removing a big portion of the failures, that is a not wanted behaviour.

    Maybe the origin of this, is because the prognosis of Wind turbines failures is a very unbalanced problem. (commonly 98 to 1% failures).

    Now I’m filtering by and expert-in-the-field method that is a manually defined ranges by the expert for each variable that excludes imposible values.

    The real SCADA data is a very noisy one because the technicians disconnects sensors and they are working several times at the year on the turbine generating many outliers. Also thereis some information compression and also many missing data.

    Do you have a suggestion for filtering outliers in a problem like this?

    • Avatar
      Jason Brownlee March 16, 2018 at 6:16 am #

      Perhaps you can codify the expert method using statistics – e.g. probabilistic tolerance intervals:
      https://en.wikipedia.org/wiki/Tolerance_interval

      If this works, try to lift skill at detection using ML methods that use the simple tolerance intervals as inputs as well as other engineered features.

      Does that help?

  10. Avatar
    sangeetha sivakumar March 22, 2018 at 6:48 pm #

    how to view the data which is removed because of using outlier function

    • Avatar
      Jason Brownlee March 23, 2018 at 6:04 am #

      Perhaps you could save the removed data as part of the filtering process?

  11. Avatar
    Mohammad May 24, 2018 at 6:45 pm #

    Thanks for your great tutorial.

    I would like to know are these tools applicable for image type data. Assume that I have ~ 100k images which are used for training a Convolutional Neural Network and they were crawled by me. i am going to remove some images (outliers) which are not related to my specific task. Do these approaches work for my problem? Do have any idea for removing outliers in my dataset?

    Does “feature extraction using pretrained CNN + clustering” work for my problem?

    • Avatar
      Jason Brownlee May 25, 2018 at 9:22 am #

      Really good question.

      I’m not sure off hand. Perhaps clustering and distance from centroid would be a good start.

      Hit the literature!

  12. Avatar
    James June 29, 2018 at 3:40 pm #

    Hi Jason,

    thank you for sharing. I understand outliers are effectively ‘relative to’. I have little issue where it is relative to the global population, but do I model an anomaly detection where it is relative to the individual’s past behavior? If i were to cluster to detect anomaly, how should I cluster each individual, and optimise the right number of clusters per individual iteratively? How many models would that require?

    stumped,
    James

    • Avatar
      Jason Brownlee June 30, 2018 at 6:02 am #

      There is no one best way James, I’d encourage you to brainstorm a suite of approaches, test each.

      This will help you learn more about the problem and help you zoom into an approach and methods that work best for your specific case.

      Also, skim the literature for more ideas, e.g. scholar.google.com.

  13. Avatar
    Ram July 7, 2018 at 11:19 pm #

    Hi Jason,

    Thanks for sharing the article. I have been working on a bit different dataset which is not binary (0,1) and not continuous. In other words, my CSV file looks like this
    P1 P2 P3 P4 H
    550 200 35.5 2.5 1.6
    553 195 30.5 2.5 1.6
    552 201 35.5 2.5 -2.6
    array=dataset.values
    X = array[:,0:3]
    Y = array[:,3]
    i am trying to train the dataset and this is the error, I am facing raise ValueError(“Unknown label type: %r” % y_type)
    ValueError: Unknown label type: ‘continuous’
    i tried to rescale the data but still the problem persists. Any help from your side will be highly appreciated

    • Avatar
      Jason Brownlee July 8, 2018 at 6:22 am #

      Try removing the header line from the file? Or excluding it when loading or just after loading the data.

  14. Avatar
    Vignesh November 21, 2018 at 7:18 pm #

    Hi Json,

    thanks for nice post. I have a dataset (40K rows) which contains 4 categorical columns (more than 100 levels for two columns and around 20 levels for other two columns) and 1 numeric column. If there are only numeric columns then it could be very easy by using these suggested methods to detect anomalies but having categorical variable, I am confused on how to select right approach. I have tried using Isolation forest and Local outlier factor method from Scikit learn and detected anomalies by them but I am not sure how did they detect those observations as anomalies. (By manually looking over the outlier data points doesn’t seems anomalous.) Suggest how to solve this.

    • Avatar
      Jason Brownlee November 22, 2018 at 6:22 am #

      I don’t have material on this topic, I hope to cover anomaly detection in the future.

  15. Avatar
    Ajay January 31, 2019 at 4:43 pm #

    Sir,
    Is outlier a separate machine learning technique?

  16. Avatar
    Jaya July 16, 2019 at 6:19 pm #

    Hello Jason

    Can you tell any application of outlier ranking?

  17. Avatar
    Lokesh_jeykar November 15, 2019 at 5:29 pm #

    hi jason,

    i have a doubt on how to detect the outliers on multivariate data with the features of 20 ?
    without using pca,and a person who is not expertise in that domain related to the datasets.

    • Avatar
      Jason Brownlee November 16, 2019 at 7:21 am #

      Perhaps try some outlier detection algorithms, e.g. one-class prediction?

  18. Avatar
    Omar January 1, 2020 at 7:01 pm #

    Dear Jason,

    If I keep outliers in my data after scaled it using scikit-learn’s Robust Scaler, I noticed that the outliers will have value bigger than 1. Suppose that I don’t want to remove the outlier because it is an important data point.

    Is neural network OK with having some inputs occasionally have value bigger than 1?

    Thank you

  19. Avatar
    Abhi January 16, 2020 at 8:40 am #

    I have a minute by minute data and total number of users of that particular minute how can i detect rate change in real time as of now i am doing it with z scores and comparing it with historical data but i am getting lots of false positives alerts. I also want to implement the same in multivariate time series. My data looks like below :-

    Time No_of_users
    2020-10-11 19:01:00 176,000
    2020-10-11 19:02:00 178,252

    As of now we are doing this on just one data point but we are thinking of adding more values and correlating it.

    Time No_of_users Total_logging Total_token_request
    2020-10-11 19:01:00 176,000 5000 52000
    2020-10-11 19:02:00 178,252 5638 53949

    Any link or guidance will be helpful.

    Best

    • Avatar
      Jason Brownlee January 16, 2020 at 1:32 pm #

      Sorry, I don’t have exampels for anomaly detection in time series. I hope to cover it in the future.

  20. Avatar
    gayathri June 27, 2020 at 7:15 pm #

    Hi Jason, I am sharing my view on identifying outlier. Please feel free to correct me If I am wrong any where and share your though

    Do we need to identify outliers for all types of questions/problems ? No

    1.Regression (how many/much) use cases – Yes
    —–Numeric input – Numeric Outpt -> uni-variate – Use Extreme Value Analysis (Scatter plot, histogran , box plot)
    —–Numeric input – Numeric Outpt -> multivariate – Use PCA ??

    2.Classification use cases – No
    —–1.In the case of Predict heart disease ,Every patient’s case is imp , so I don’t work on identifying outlier. I will evaluate accuracy of model
    —–2.Some Algorithms itself robust to handle outlier , ex- decision tree

    3.Clustering use cases – Yes
    —–Visualize raw data – Extreme Value Analysis -Scatter plot matrix (less number of variables), heat map ?
    —–Evaluate model , visualize result and identify outliers – Proximity-based Models
    —–Custer in high dimension – High-Dimensional Outlier Detection

    4.Recommendation use-cases – No (algorithm should be already robust to handle outliers ?)

    5.Text Analytics , Image processing – No ?

    6. Anomaly Detection -Obvious yes, Here the problem stmt itself asks to identify anomaly /outlier

    Note: where i am not certain , i put a question mark

    • Avatar
      Jason Brownlee June 28, 2020 at 5:47 am #

      I think you have have outliers in all data types and I think it is not intuitive whether they will impact model performance or not.

  21. Avatar
    Lucy July 2, 2020 at 9:35 pm #

    Hi Jason, thanks a lot for the article.

    If I have data with 80 features and 1.5 mln values, which method (multivariate I guess) can be suitable for detecting outliers?

    • Avatar
      Jason Brownlee July 3, 2020 at 6:15 am #

      I recommend testing a suite of methods and discover through careful experiment what works best for your dataset.

      • Avatar
        Lucy July 3, 2020 at 4:32 pm #

        There are also categorical variables in data. Should I include them in multivariate outlier detection process?

  22. Avatar
    Kenneth July 4, 2020 at 8:16 am #

    Outlier detection and imputation, which one should I do first?

    • Avatar
      Jason Brownlee July 5, 2020 at 6:48 am #

      Great question!

      Try both ways and see which results in the best performance.

      I’d probably impute then do outliers.

  23. Avatar
    Tanmay Deshpande August 14, 2020 at 4:17 am #

    Hey Jason!

    For a regression problem, if I have 50 input features and 1 target variable.
    So, for good regression performance,
    Q1] Should we only consider the outlier values of the target variable to be eliminated or should we eliminate the outlier values from other features as well if they are going to be used for prediction purposes ?
    Q2] Should we consider the skewness & kurtoisis distance to dealt with of categorical features which are encoded ?

    • Avatar
      Jason Brownlee August 14, 2020 at 6:12 am #

      It depends on the data and chosen model. Try a suite of transforms and discover what works best on your project.

      • Avatar
        Tanmay Deshpande August 16, 2020 at 3:07 am #

        Ok thanks! Stay Safe!

  24. Avatar
    zulaa December 13, 2020 at 4:10 pm #

    Describe the detailed procedure to identify the outlying patterns?

    • Avatar
      Jason Brownlee December 14, 2020 at 6:12 am #

      The procedure is described in the above tutorial.

  25. Avatar
    Chandan March 12, 2021 at 5:37 am #

    Dear Jason,

    I need your suggestion. I am new to the ML world.

    How to find outliers in a time series data with input variable x and output variable y.

    Suppose Say
    X, Y
    5, 10
    10, 15
    15, 20
    20, 45

    In this case, it has to detect the 4th case as the outlier.

Trackbacks/Pingbacks

  1. Kā mašīnmācībā atrast novirzes: rokasgrāmata - Emuārs - July 26, 2022

    […] Atsauce: Mašīnmācīšanās meistarība […]

Leave a Reply