Classification is a predictive modeling problem that involves predicting a class label for a given example.
It is generally assumed that the distribution of examples in the training dataset is even across all of the classes. In practice, this is rarely the case.
Those classification predictive models where the distribution of examples across class labels is not equal (e.g. are skewed) are called “imbalanced classification.”
Typically, a slight imbalance is not a problem and standard machine learning techniques can be used. In those cases where the imbalance is severe, such as a 1:100, 1:1000, or higher ratio of the minority to the majority class, then specialized techniques are required.
The reason why specialized techniques are required for classification problems with a severe imbalance in the classes is that most machine learning models used for classification were designed and tested around the assumption that the class distribution is equal. As such, they often fail or result in misleading results.
In this tutorial, you will discover the best resources that you can use to get started with imbalanced classification.
After completing this tutorial, you will know:
- The best books on the topic of machine learning for imbalanced classification.
- The best survey papers that introduce the topic of class imbalance.
- The best Python libraries that you can use to develop solutions for your imbalanced dataset.
Kick-start your project with my new book Imbalanced Classification with Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started.
- Updated Jan/2021: Updated links for API documentation.
Tutorial Overview
This tutorial is divided into three parts; they are:
- Books on Imbalanced Classification
- Survey Papers on Imbalanced Classification
- Python Libraries for Imbalanced Classification
Books on Imbalanced Classification
Addressing imbalanced classification predictive modeling problems with machine learning is a relatively new area of study.
Nevertheless, given the pervasiveness of imbalanced classification datasets, a few books and book chapters are available on the topic.
In this section, we will take a closer look at the following books on imbalanced classification for machine learning:
- Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
- Learning from Imbalanced Data Sets, 2018.
I will also include the following book that features a dedicated chapter on the topic:
- Applied Predictive Modeling, 2013.
There are two other books I found that are related, but perhaps more tangentially, and I won’t cover them in more detail; they were:
- Statistical Methods for Imbalanced Data in Ecological and Biological Studies, 2019.
- Dealing with Imbalanced and Weakly Labelled Data in Machine Learning using Fuzzy and Rough Set Methods, 2018.
Let’s take a closer look at the books.
Imbalanced Learning: Foundations, Algorithms, and Applications
This book is a collection of papers that form chapters, edited by two academics that have written a lot on the topic: Haibo He and Yunqian Ma.
The book was published in 2013.
The book is designed to bring a postgraduate student or academic up to speed with the field of imbalanced learning. This is a more general field than imbalanced classification, as it includes other problem types where the training dataset may be imbalanced, such as regression and clustering.
Specifically, we define imbalanced learning as the learning process for data representation and information extraction with severe data distribution skews to develop effective decision boundaries to support the decision-making process. The learning process could involve supervised learning, unsupervised learning, semi-supervised learning, or a combination of two or all of them. The task of imbalanced learning could also be applied to regression, classification, or clustering tasks.
— Pages 1-2, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.
It provides an excellent starting point for a practitioner to get an overview of the field and the techniques.
The table of contents for this book is listed below.
- 1. Introduction
- 2. Foundations of Imbalanced Learning
- 3. Imbalanced Datasets: From Sampling to Classifiers
- 4. Ensemble Methods for Class Imbalance Learning
- 5. Class Imbalance Learning Methods for Support Vector Machines
- 6. Class Imbalance and Active Learning
- 7. Nonstationary Stream Data Learning with Imbalanced Class Distribution
- 8. Assessment Metrics for Imbalanced Learning
Learn more about the book here.
Want to Get Started With Imbalance Classification?
Take my free 7-day email crash course now (with sample code).
Click to sign-up and also get a free PDF Ebook version of the course.
Learning from Imbalanced Data Sets
This book is also a collection of papers on the topic of machine learning for imbalanced datasets, although feels more cohesiveness than the previous book “Imbalanced Learning.”
The book was written or edited by a laundry list of academics Alberto Fernández, Salvador García, Mikel Galar, Ronaldo Prati, Bartosz Krawczyk, and Francisco Herrera and was published in 2018.
Similar to the previous book, this book is designed to bring postgraduate students and engineers up to speed with the field of machine learning for imbalanced datasets.
The intended audience of this book are developers and engineers aiming to apply imbalance-learning techniques to solve different kinds of real-world problems, as well as researchers and students needing a comprehensive review on techniques, methodologies, and tools for learning from imbalanced data.
— Page viii, Learning from Imbalanced Data Sets, 2018.
The book reads as being more systematic (e.g. working through a project end-to-end) and practical than the previous book, which read as more academic (pet methods or subfields). I would recommend buying both together if you had the budget.
The table of contents for this book is listed below.
- 1. Introduction to KDD and Data Science
- 2. Foundations on Imbalanced Classification
- 3. Performance Measures
- 4. Cost-Sensitive Learning
- 5. Data Level Preprocessing Methods
- 6. Algorithm-Level Approaches
- 7. Ensemble Learning
- 8. Imbalanced Classification with Multiple Classes
- 9. Dimensionality Reduction for Imbalanced Learning
- 10. Data Intrinsic Characteristics
- 11. Learning from Imbalanced Data Streams
- 12. Non-classical Imbalanced Classification Problems
- 13. Imbalanced Classification for Big Data
- 14. Software and Libraries for Imbalanced Classification
Learn more about the book here.
Applied Predictive Modeling
This is one of my favorite handbooks for applied machine learning, written by Max Kuhn and Kjell Johnson and focused on R.
The book was published in 2013, but the general advice is probably timeless.
- Applied Predictive Modeling, 2013.
Although the whole book is a great read, the book has one chapter dedicated to the problem of imbalanced classification.
- Chapter 16: Remedies for Severe Class Imbalance
The approach to the chapter is a case study on a “Caravan Policy Ownership” dataset. The authors work through this problem to demonstrate a suite of different practical techniques for handling a severe class imbalance.
This chapter is required reading for a practical demonstration on how to work through a real-world imbalanced dataset using modern methods.
The sections of this chapter are as follows:
- 16.1 Case Study: Predicting Caravan Policy Ownership
- 16.2 The Effect of Class Imbalance
- 16.3 Model Tuning
- 16.4 Alternate Cutoffs
- 16.5 Adjusting Prior Probabilities
- 16.6 Unequal Case Weights
- 16.7 Sampling Methods
- 16.8 Cost-Sensitive Training
- 16.9 Computing
Learn more about the book here.
Survey Papers on Imbalanced Classification
There are thousands of publications on machine learning methods for imbalanced classification and related problems and techniques.
Instead of enumerating the best papers in the field, in this section, we will take a look at some of the best survey papers.
A survey paper is a paper that gives a broad overview of the field and position of the techniques in the field and how they might relate to each other. They are designed to help newcomers to the field, such as postgraduate students and engineers, get up-to-speed rapidly.
As a practitioner, reading a survey paper may be more efficient than skimming books on the topic.
There are many great survey papers to choose from; my recommended favorites are as follows:
- Learning From Imbalanced Data: Open Challenges And Future Directions, Bartosz Krawczyk, 2016.
- A Survey of Predictive Modelling under Imbalanced Distributions, Paula Branco, Luis Torgo, and Rita Ribeiro, 2015.
- Classification Of Imbalanced Data: A Review, Yanmin Sun, Andrew Wong, Mohamed Kamel, 2009.
- Learning from Imbalanced Data, Haibo He and Edwardo Garcia, 2009.
I also recommend study papers, papers that demonstrate one or more standard techniques against a suite of standard machine learning datasets. In this case, the techniques are designed to address the imbalanced class distribution and the standard datasets have a skewed class distribution.
These papers quickly flush out what methods work (or are popular) and what datasets are useful as benchmarks.
Some examples of good papers of this type include:
- An Insight Into Classification With Imbalanced Data: Empirical Results And Current Trends On Using Data Intrinsic Characteristics, 2013.
- A Study Of The Behavior Of Several Methods For Balancing Machine Learning Training Data, 2004.
Python Libraries for Imbalanced Classification
Python has rapidly become the preferred programming language for applied machine learning.
Scikit-Learn Library
The go-to library for machine learning in Python is scikit-learn, which provides data preparation, machine learning algorithms, and model evaluation schemes, among other techniques.
Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language.
— Scikit-learn: Machine Learning in Python, 2011.
Although not designed around the problem of imbalanced classification, the scikit-learn library does provide some tools for handling imbalanced datasets, such as:
- Support for a range of metrics, e.g. ROC AUC and precision/recall, F1, Brier Score and more.
- Support for class weighting, e.g. Decision Trees, SVM and more.
Imbalanced-Learn Library
A project related to scikit-learn dedicated to the problem of imbalanced classification is called imbalanced-learn.
It provides techniques that can be used for imbalanced classification in conjunction with the scikit-learn library, allowing learning algorithms and model evaluation techniques to be shared between the libraries.
imbalanced-learn is an open-source python toolbox aiming at providing a wide range of methods to cope with the problem of imbalanced dataset frequently encountered in machine learning and pattern recognition.
— Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, 2016.
The library focuses on providing oversampling and undersampling techniques to make the class distribution more equal in a training dataset prior to fitting a given machine learning model.
For more on imbalanced-learn, see:
- imbalanced-learn Project, GitHub.
- imbalanced-learn Documentation.
- Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning, 2016.
Summary
In this tutorial, you discovered the best resources that you can use to get started with imbalanced classification.
Specifically, you learned:
- The best books on the topic of machine learning for imbalanced classification.
- The best survey papers that introduce the topic of class imbalance.
- The best Python libraries that you can use to develop solutions for your imbalanced dataset.
Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.
Will be nice to see an example with code, since this is a problem we come across evday.
Thanks.
Yes, I have many examples with code scheduled.
Hi
Just out of my curiosity, have you already read all these books and papers?
Hahha, yes.
I read them in preparation for my next book on imbalanced classification scheduled for release in Jan 2020.
Thanks for your share these related work about class-imbalance learning, we are looking forward to awaiting your new book about how to deal with this problem using a new strategy.
You’re welcome.
Thanks for sharing, this is what I’m looking for, can’t wait for your forthcoming tutorial
Thanks!
Do you have any tutorial for fuzzy logic classier
Not at this stage.
Thanks for sharing, could you recommend a public dataset related to telco retention/propensity to play with all these techniques?
This will help you locate a dataset:
https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___