8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

Has this happened to you?

You are working on your dataset. You create a classification model and get 90% accuracy immediately. “Fantastic” you think. You dive a little deeper and discover that 90% of the data belongs to one class. Damn!

This is an example of an imbalanced dataset and the frustrating results it can cause.

In this post you will discover the tactics that you can use to deliver great results on machine learning datasets with imbalanced data.

Class Imbalance

Find some balance in your machine learning.
Photo by MichaEli, some rights reserved.

Coming To Grips With Imbalanced Data

I get emails about class imbalance all the time, for example:

I have a binary classification problem and one class is present with 60:1 ratio in my training set. I used the logistic regression and the result seems to just ignores one class.

And this:

I am working on a classification model. In my dataset I have three different labels to be classified, let them be A, B and C. But in the training dataset I have A dataset with 70% volume, B with 25% and C with 5%. Most of time my results are overfit to A. Can you please suggest how can I solve this problem?

I write long lists of techniques to try and think about the best ways to get past this problem. I finally took the advice of one of my students:

Perhaps one of your upcoming blog posts could address the problem of training a model to perform against highly imbalanced data, and outline some techniques and expectations.

Frustration!

Imbalanced data can cause you a lot of frustration.

You feel very frustrated when you discovered that your data has imbalanced classes and that all of the great results you thought you were getting turn out to be a lie.

The next wave of frustration hits when the books, articles and blog posts don’t seem to give you good advice about handling the imbalance in your data.

Relax, there are many options and we’re going to go through them all. It is possible, you can build predictive models for imbalanced data.

What is Imbalanced Data?

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.

For example, you may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2.

This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.

You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either.

The remaining discussions will assume a two-class classification problem because it is easier to think about and describe.

Imbalance is Common

Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter.

There are problems where a class imbalance is not just common, it is expected. For example, in datasets like those that characterize fraudulent transactions are imbalanced. The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class.

Another example is customer churn datasets, where the vast majority of customers stay with the service (the “No-Churn” class) and a small minority cancel their subscription (the “Churn” class).

When there is a modest class imbalance like 4:1 in the example above it can cause problems.

Accuracy Paradox

The accuracy paradox is the name for the exact situation in the introduction to this post.

It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.

It is very common, because classification accuracy is often the first measure we use when evaluating models on our classification problems.

Put it All On Red!

What is going on in our models when we train on an imbalanced dataset?

As you might have guessed, the reason we get 90% accuracy on an imbalanced data (with 90% of the instances in Class-1) is because our models look at the data and cleverly decide that the best thing to do is to always predict “Class-1” and achieve high accuracy.

This is best seen when using a simple rule based algorithm. If you print out the rule in the final model you will see that it is very likely predicting one class regardless of the data it is asked to predict.

8 Tactics To Combat Imbalanced Training Data

We now understand what class imbalance is and why it provides misleading classification accuracy.

So what are our options?

1) Can You Collect More Data?

You might think it’s silly, but collecting more data is almost always overlooked.

Can you collect more data? Take a second and think about whether you are able to gather more data on your problem.

A larger dataset might expose a different and perhaps more balanced perspective on the classes.

More examples of minor classes may be useful later when we look at resampling your dataset.

2) Try Changing Your Performance Metric

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes.

I give more advice on selecting different performance measures in my post “Classification Accuracy is Not Enough: More Performance Measures You Can Use“.

In that post I look at an imbalanced dataset that characterizes the recurrence of breast cancer in patients.

From that post, I recommend looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

  • Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
  • Precision: A measure of a classifiers exactness.
  • Recall: A measure of a classifiers completeness
  • F1 Score (or F-score): A weighted average of precision and recall.

I would also advice you to take a look at the following:

  • Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
  • ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.

You can learn a lot more about using ROC Curves to compare classification accuracy in our post “Assessing and Comparing Classifier Performance with ROC Curves“.

Still not sure? Start with kappa, it will give you a better idea of what is going on than classification accuracy.

3) Try Resampling Your Dataset

You can change the dataset that you use to build your predictive model to have more balanced data.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

  1. You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
  2. You can delete instances from the over-represented class, called under-sampling.

These approaches are often very easy to implement and fast to run. They are an excellent starting point.

In fact, I would advise you to always try both approaches on all of your imbalanced datasets, just to see if it gives you a boost in your preferred accuracy measures.

You can learn a little more in the the Wikipedia article titled “Oversampling and undersampling in data analysis“.

Some Rules of Thumb

  • Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)
  • Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)
  • Consider testing random and non-random (e.g. stratified) sampling schemes.
  • Consider testing different resampled ratios (e.g. you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)

4) Try Generate Synthetic Samples

A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

You could sample them empirically within your dataset or you could use a method like Naive Bayes that can sample each attribute independently when run in reverse. You will have more and different data, but the non-linear relationships between the attributes may not be preserved.

There are systematic algorithms that you can use to generate synthetic samples. The most popular of such algorithms is called SMOTE or the Synthetic Minority Over-sampling Technique.

As its name suggests, SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

Learn more about SMOTE, see the original 2002 paper titled “SMOTE: Synthetic Minority Over-sampling Technique“.

There are a number of implementations of the SMOTE algorithm, for example:

  • In Python, take a look at the “UnbalancedDataset” module. It provides a number of implementations of SMOTE as well as various other resampling techniques that you could try.
  • In R, the DMwR package provides an implementation of SMOTE.
  • In Weka, you can use the SMOTE supervised filter.

5) Try Different Algorithms

As always, I strongly advice you to not use your favorite algorithm on every problem. You should at least be spot-checking a variety of different types of algorithms on a given problem.

For more on spot-checking algorithms, see my post “Why you should be Spot-Checking Algorithms on your Machine Learning Problems”.

That being said, decision trees often perform well on imbalanced datasets. The splitting rules that look at the class variable used in the creation of the trees, can force both classes to be addressed.

If in doubt, try a few popular decision tree algorithms like C4.5, C5.0, CART, and Random Forest.

For some example R code using decision trees, see my post titled “Non-Linear Classification in R with Decision Trees“.

For an example of using CART in Python and scikit-learn, see my post titled “Get Your Hands Dirty With Scikit-Learn Now“.

6) Try Penalized Models

You can use the same algorithms but give them a different perspective on the problem.

Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class.

Often the handling of class penalties or weights are specialized to the learning algorithm. There are penalized versions of algorithms such as penalized-SVM and penalized-LDA.

It is also possible to have generic frameworks for penalized models. For example, Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification.

Using penalization is desirable if you are locked into a specific algorithm and are unable to resample or you’re getting poor results. It provides yet another way to “balance” the classes. Setting up the penalty matrix can be complex. You will very likely have to try a variety of penalty schemes and see what works best for your problem.

7) Try a Different Perspective

There are fields of study dedicated to imbalanced datasets. They have their own algorithms, measures and terminology.

Taking a look and thinking about your problem from these perspectives can sometimes shame loose some ideas.

Two you might like to consider are anomaly detection and change detection.

Anomaly detection is the detection of rare events. This might be a machine malfunction indicated through its vibrations or a malicious activity by a program indicated by it’s sequence of system calls. The events are rare and when compared to normal operation.

This shift in thinking considers the minor class as the outliers class which might help you think of new ways to separate and classify samples.

Change detection is similar to anomaly detection except rather than looking for an anomaly it is looking for a change or difference. This might be a change in behavior of a user as observed by usage patterns or bank transactions.

Both of these shifts take a more real-time stance to the classification problem that might give you some new ways of thinking about your problem and maybe some more techniques to try.

8) Try Getting Creative

Really climb inside your problem and think about how to break it down into smaller problems that are more tractable.

For inspiration, take a look at the very creative answers on Quora in response to the question “In classification, how do you handle an unbalanced training set?

For example:

Decompose your larger class into smaller number of other classes…

…use a One Class Classifier… (e.g. treat like outlier detection)

…resampling the unbalanced training set into not one balanced set, but several. Running an ensemble of classifiers on these sets could produce a much better result than one classifier alone

These are just a few of some interesting and creative ideas you could try.

For more ideas, check out these comments on the reddit post “Classification when 80% of my training set is of one class“.

Pick a Method and Take Action

You do not need to be an algorithm wizard or a statistician to build accurate and reliable models from imbalanced datasets.

We have covered a number of techniques that you can use to model an imbalanced dataset.

Hopefully there are one or two that you can take off the shelf and apply immediately, for example changing your accuracy metric and resampling your dataset. Both are fast and will have an impact straight away.

Which method are you going to try?

A Final Word, Start Small

Remember that we cannot know which approach is going to best serve you and the dataset you are working on.

You can use some expert heuristics to pick this method or that, but in the end, the best advice I can give you is to “become the scientist” and empirically test each method and select the one that gives you the best results.

Start small and build upon what you learn.

Want More? Further Reading…

There are resources on class imbalance if you know where to look, but they are few and far between.

I’ve looked and the following are what I think are the cream of the crop. If you’d like to dive deeper into some of the academic literature on dealing with class imbalance, check out some of the links below.

Books

Papers

Did you find this post useful? Still have questions?

Leave a comment and let me know about your problem and any questions you still have about handling imbalanced classes.

74 Responses to 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset

  1. Sebastian Raschka August 26, 2015 at 2:47 am #

    Interesting survey! Maybe it would be worthwhile to mention semi-supervised techniques to utilize unlabeled data? There are many different approaches, if you are interested, check out this nice survey: X. Zhu, “Semi-Supervised Learning Literature Survey,” Technical Report 1530, Univ. of Wisconsin-Madison, 2006.

    Transfer learning can also be interesting in context of class imbalances for using unlabeled target data as regularization term to learn a discriminative subspace that can generalize to the target domain: Si S, Tao D, Geng B. Bregman divergence-based regularization for transfer subspace learn- ing. IEEE Trans on Knowledge and Data Engineering 2010;22:929–42.

    Or for the very extreme cases 1-class SVM 😛 Scholkopf B, Platt JC, Shawe-Taylor JC, Smola AJ, Williamson RC.Estimating the support of a high-dimensional distribution. Neural Computation 2001;13:1443–71.

  2. Igor Vieira October 29, 2015 at 8:26 am #

    Great post!!

  3. Jingchen November 16, 2015 at 11:26 am #

    Hi Jason, this is a very helpful post. Save me a lot of time for checking detailed solutions and it’s eye-opening.

    Thanks

  4. Haifeng Liu November 21, 2015 at 1:54 pm #

    This is really a great and helpful post!

  5. Parinya.hi January 13, 2016 at 3:47 pm #

    Hi Jason, really cool and helpful post. I have done some but for my case seems quite difficult because of most of predictor values are flag (0 or 1).

  6. Abraham B.Gabriel January 18, 2016 at 1:07 pm #

    Hi Jason. You just saved a life. (Quite literally).Thanks a lot for the article.

  7. Vered Shwartz February 1, 2016 at 7:52 am #

    Thanks for a very helpful post! I have an unbalanced dataset (50:1 negative-positive ratio) and I’ve tried some of the techniques you discussed. I was wondering about subsampling –
    if I train my model over all the positive instances and an equal number of negative instances, there is a lot of unused data. Is there a reasonable way that I can perform several iterations, each with different negative instances, and combine the results into one model? Thanks in advance.

  8. Jason Brownlee February 3, 2016 at 8:50 pm #

    Great and relevant post: Dealing with imbalanced data: undersampling, oversampling and proper cross-validation , by Marco Altini.

  9. Adeyemi February 4, 2016 at 8:43 am #

    Great. I want to try the SMOTE with Weka, is there any simple sample tutorial to us ethe SMOTE supervised filter? I need guidance

  10. Matthew R. Versaggi April 11, 2016 at 9:29 pm #

    Fantastic, just what the doctor ordered !

    Thank you for your efforts, it’s enabling the advancing of the field …

  11. Ella April 23, 2016 at 5:52 am #

    Thank you, very helpful post.
    I am using “UnbalancedDataset” module in Python to perform over sampling with synthetic data generation (SMOTE/Random) I am wondering if there is any smart way to find the best ratio for over sampling ?

    I have a binary classification problem with imbalanced data with the rate of 1/5.

    • Jason Brownlee July 8, 2016 at 7:11 am #

      Generally, I would advise systematic experimentation to discover good or best configuration for your problem.

  12. Bar Geron June 4, 2016 at 10:26 pm #

    Hi,

    Great article!

    it will be much appreciated if you can help with the following question:

    I’ve used the over sampling approach and change the ratio of my binary target value from 1:10 to 1:1.
    the problem is that i still don’t how to check the model performance on the ratio of 1:10.
    how do i know what will be the gap of impact between the real world and the 1:1 ratio ?

    • Jason Brownlee June 14, 2016 at 8:26 am #

      A good idea would be to hold back a validation dataset, say split the dataset in half.

      Try various rebalancing methods and modeling algorithms with cross validation, then use the held back dataset to confirm any findings translate to a sample of what the actual data will look like in practice.

      • Hua Yang July 8, 2016 at 5:03 am #

        Hi Jason,
        I have the same question as Bar Geron.
        what did you mean by saying “then use the held back dataset to confirm any findings translate to a sample of what the actual data will look like in practice”?

        Could you please explain your it with more details?

        Thank you!

        • Jason Brownlee July 8, 2016 at 7:13 am #

          I meant that you can use cross validation on the rebalanced dataset to estimate the performance of models on unseen data.

          You can then build a final model and evaluate it’s performance on the held out dataset.

          This will allow you to see whether findings from resampling during cross validation translate over to “unseen” data.

  13. David F July 1, 2016 at 1:26 am #

    Pretty useful article. Thank you very much!

  14. Kaustubh Patil July 3, 2016 at 4:03 am #

    Another tactic is to change the decision threshold on the posterior probability. We have shown that this particularly works well with bagging ensemble which are known to give good posterior estimates. See ” Reviving Threshold-Moving: a Simple Plug-in Bagging Ensemble for Binary and Multiclass Imbalanced Data” http://arxiv.org/abs/1606.08698.

    Disclaimer: I am the last author of this paper

  15. Manish July 11, 2016 at 2:40 am #

    Great article! Very helpful.

  16. ankita July 22, 2016 at 10:42 am #

    sir,, in future which issues related to classfication problem which can be solved?

  17. RCB August 3, 2016 at 1:13 am #

    I consider this a non-issue. There’s no statistical method or machine learning algorithm I know of that requires balanced data classes. Furthermore, if *reality is unbalanced*, then you want your algorithm to learn that!

    Consider the problem of trying to predict two outcomes, one of which is much more common than the other. Suppose there is a region in feature space in which the two classes very strongly overlap. Then the prediction in this region will depend on the frequency of each class that fall in this region in the training set. If you’ve “balanced” the data by hugely biasing it toward the rare class, then your model will predict something like 50% probability of each, when the truth is probably very different.

    The problem, IMO, isn’t unbalance. The world is unbalanced. The problem is that rare classes are poorly represented unless the datasets are quite large. In other words, it’s a sample size problem. A lot of the difficulty can be cleared up (as the author points out) but looking at false positive and false negative rates, not just generic “accuracy”.

    • Jason Brownlee August 3, 2016 at 8:17 am #

      Thought provoking perspective RCB, thanks for sharing.

      I have to disagree that this is a non-issue in practice. At least for me, I almost always seem to get better results when I “handle” the class imbalance.

      As a test, grab an unbalanced dataset from the UCI ML repo and do some small experiments. I think you will quickly realize that either you need to change up your performance measure and start exploring resampling or instance weighting methods to get any kind of traction on the problem.

      In fact, this might make a nice blog post tutorial, stay tuned.

      • RCB August 5, 2016 at 8:20 am #

        At the end of the day, performance is what matters, so I won’t be so foolish as to take a hard-line stance. But I’m inclined to think that there is always a better alternative to”rebalancing” the data, i.e. one should never have to do it, in theory.

        Your model is doing its best to minimize the loss function you specify. If this is just classification accuracy, then it’s quite plausible that the best classifier is one that always picks the vastly-more-common class. What this is telling you is that the model has not seen enough examples of the rare class to be able to distinguish them from the common class. Failing that, it simply says “forget it: just always predict the most common class!” If you’re only interested in 1-0 classification accuracy, then that is the best model, period, given the loss function and dataset you provided.

        Now, if you find yourself thinking that this is a very unsatisfactory outcome, ask yourself why! Probably it’s because misclassification of the rare class is a lot worse than the alternative. i.e., false negatives are a lot worse than false positives. Perhaps you are diagnosing cancers, or catching failed products. Well, clearly this problem is solved by choosing a more appropriate loss function – not biasing the data! Just make the “cost” of a false negative much greater than the cost of a false positive. This will give you a cost function that better represents your priorities, while still maintaining a realistic dataset. Rebalancing does neither!

        Also: By hugely rebalancing (read: hugely biasing) the model, you are training on a dataset that will be vastly different from the dataset it will actually be tested on, once implemented. That strikes me as a bad thing. Train it for the real world.

        IMO.

  18. Chris John August 4, 2016 at 8:04 pm #

    Thanks for this!

  19. Simon August 5, 2016 at 9:32 pm #

    Hi Jason, can windowing a long time series be used as a sampling technique? That way, we get many smaller segments of the same time series, and if we label them up the same, we can consider them as larger data to extract features from, can we not? In that case, what criteria should we look at? Long range correlations? I know that the statistics can change, so usually a non-stationary time series can be changed to a stationary time series either through filtering or some sort of background levelling (to level the trend). Am I thinking in the right directions?
    The second part of my question is, if we do not go for sampling methods and consider the whole time series as one data point, what classification and feature extraction algorithm should I look for?

    Eagerly waiting for your reply. Many thanks
    Simon

  20. Evelyn August 11, 2016 at 7:25 am #

    Hi Jason,

    I have a question about how should we deal with the over sampled dataset. There are two ways come to my mind and I am now going with the first one, which seem very overfitting.

    1- Oversample the whole dataset, then split it to training and testing sets (or cross validation).

    2- After splitting the original dataset, perform oversampling on the training set only and test on the original data test set.

    My dataset has multi-labels, there is one majority label and number of samples for others label are quite small, some even has the ratio 100:1, how should I deal with this dataset?

    Thanks a lot!

    • Jason Brownlee August 15, 2016 at 11:25 am #

      I would suggest separating out a valdiation dataset for later use.

      I would suggest applying your procedure (say oversampling) within the folds of a cross validation process with possible. Otherwise just on the training dataset for a train/test split.

  21. Marta AZ August 11, 2016 at 11:07 pm #

    Thank you for the article. It is a very good and effective summary of a wide a complicated problem.

    Please go on with your blog!

  22. Mohammed Tantawy August 30, 2016 at 11:37 am #

    Great Post Jason

    • Jason Brownlee August 31, 2016 at 8:44 am #

      Thanks Mohammed, I’m glad you found it useful.

  23. Linara September 5, 2016 at 8:01 pm #

    Thanks for tips!

    Can you please elaborate more or give some useful sources for the Penalized models? I am using logistic regression with standard log likelihood loss function ( -mean(teacher*log(predicted) + (1-teacher)*log(1-predicted)) ) and I want to know what exactly is a correct way to make it pay more attention to 1-class, because my data has about 0.33% of 1-class examples and all the others are 0-class.

    • Jason Brownlee September 6, 2016 at 9:48 am #

      Perhaps you could experiment with weighting observations for one class or another. I have seen this be very effective with regression methods.

      • Linara September 6, 2016 at 6:30 pm #

        The main question is more about what part should be more “important”? I have to put more weight to the error part that is obtained from the rare class (e.g. to have something like -mean(0.9*teacher*log(predicted) + 0.1*(1-teacher)*log(1-predicted))) or other way around – I have to “penalize” big class (e.g. -mean(0.1*teacher*log(predicted) + 0.9*(1-teacher)*log(1-predicted)))? Because penalizing is more about that I have to do something with big class and weighting is the thing that has to be larger for the rare class and this terminology completly confuses me.
        And does the weights have something to do by value with the frequency of the rare class?

        • Jason Brownlee September 7, 2016 at 10:26 am #

          Yes, the weight is to rebalance the observations for each class, so the sum of observations for each class are equal.

          • K Rajesh September 11, 2016 at 2:53 am #

            Sir, I am also working on such type of imbalanced multi-class problem. In my case, accuracy values are over dependent on normalization procedure.
            Following discussion will give an overview of my problem.

            Is it possible to do feature normalization with respective to class. EX: 10×10 data matrix with two class. Each class of size 5×5. Now normalize 25 features of class 1 and 25 features of class 2 separately. Is this process acceptable.

          • Jason Brownlee September 12, 2016 at 8:28 am #

            No. You will not know the class of new data in the future, therefore you won’t know what procedure to use.

          • Abdul October 20, 2016 at 4:59 pm #

            Sir, How to specify weights in RF.

            I need help is specifying weights for each splits instead of gini indexing.

            Your response will be appreciated.

          • Jason Brownlee October 21, 2016 at 8:33 am #

            I don’t know about weighted random forest or weighted gini calculation, sorry Abdul.

  24. Erik Yao September 23, 2016 at 5:11 am #

    Thank you, Jason! I am playing around with a 19:1 data set and your post provides a lot of techniques to handle the imbalance. I am very interested to try them one by one to see what can I get at best.

  25. Gilmar October 4, 2016 at 1:30 pm #

    Great post Jason!!!

  26. charisfauzan October 13, 2016 at 3:47 pm #

    Great post sir, It is useful for me…

  27. Sarah November 16, 2016 at 6:27 am #

    Great post. Read it almost 5 times.

  28. Chris January 3, 2017 at 9:58 pm #

    Check my question here please. I don’t know what happens.

    http://cs.stackexchange.com/questions/68212/big-number-of-false-positives-in-binary-classification

    • Jason Brownlee January 4, 2017 at 8:52 am #

      Hi Chris, perhaps you could write a one sentence summary of your problem?

  29. Licheng January 15, 2017 at 7:25 am #

    Hi Jason, Thanks for the great article.

    I have a question about imbalanced multiclass problem (7 classes). I tried oversampling SMOTE and it seems what it does is to match the class with least samples to the class with most samples, nothing changes with the other classes. I wonder if this is how it should be.

  30. Natheer Alabsi January 25, 2017 at 3:52 pm #

    Hi Jason,

    Is it acceptable to intentionally choose an imbalanced subset of the two classes available in the data for training if that will increase the accuracy of the model.

    • Jason Brownlee January 26, 2017 at 4:43 am #

      I like your thinking Natheer, try and see!

      Consider an ensemble of a suite of models, biased in different ways.

      Do not use the accuracy measure to evaluate your results though, it will not give a clear idea of how well your model is actually performing.

      • Natheer Alabsi January 26, 2017 at 9:14 pm #

        Thanks for the reply. But I want to use only one sample from the negative class(not buy the product) and a large sample from the positive class(buy the product). I noticed it improved the accuracy so much.

        • Jason Brownlee January 27, 2017 at 12:06 pm #

          Hi Natheer, in this case accuracy would be an invalid measure of performance.

          • Natheer Alabsi January 27, 2017 at 12:12 pm #

            I know accuracy of the overall model is meaningless but it is best increase in recall over other situations.

  31. Jingjing February 27, 2017 at 1:31 pm #

    Thank you for your effort. It’s really helpful! Do you have some imbalanced data sets? I can not find some perfect data sets for my algorithm.

  32. deep March 22, 2017 at 7:02 pm #

    Hi Jason,

    Thanks for uploading a very nice informative article about imbalance dataset handling. I am trying to build deep learning model for classification. I have data set consist of approx 100k samples with around 36k features and six different classes with imbalanced class distribution. The largest class has approx 48k samples while smallest one has around 2k samples. Other classes have sample numbers like 18k,15k, 12kand 5k. I am considering the usage of smote for synthetic data generation for all small classes(18k-2k ) up to 48K (biggest class). Is that scientifically appropriate approach? If not what else I can do?

Leave a Reply