Imbalanced Classification with the Fraudulent Credit Card Transactions Dataset

Last Updated on August 21, 2020

Fraud is a major problem for credit card companies, both because of the large volume of transactions that are completed each day and because many fraudulent transactions look a lot like normal transactions.

Identifying fraudulent credit card transactions is a common type of imbalanced binary classification where the focus is on the positive class (is fraud) class.

As such, metrics like precision and recall can be used to summarize model performance in terms of class labels and precision-recall curves can be used to summarize model performance across a range of probability thresholds when mapping predicted probabilities to class labels.

This gives the operator of the model control over how predictions are made in terms of biasing toward false positive or false negative type errors made by the model.

In this tutorial, you will discover how to develop and evaluate a model for the imbalanced credit card fraud dataset.

After completing this tutorial, you will know:

  • How to load and explore the dataset and generate ideas for data preparation and model selection.
  • How to systematically evaluate a suite of machine learning models with a robust test harness.
  • How to fit a final model and use it to predict the probability of fraud for specific cases.

Let’s get started.

How to Predict the Probability of Fraudulent Credit Card Transactions

How to Predict the Probability of Fraudulent Credit Card Transactions
Photo by Andrea Schaffer, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

  1. Credit Card Fraud Dataset
  2. Explore the Dataset
  3. Model Test and Baseline Result
  4. Evaluate Models
  5. Make Predictions on New Data

Credit Card Fraud Dataset

In this project, we will use a standard imbalanced machine learning dataset referred to as the “Credit Card Fraud Detection” dataset.

The data represents credit card transactions that occurred over two days in September 2013 by European cardholders.

The dataset is credited to the Machine Learning Group at the Free University of Brussels (Université Libre de Bruxelles) and a suite of publications by Andrea Dal Pozzolo, et al.

All details of the cardholders have been anonymized via a principal component analysis (PCA) transform. Instead, a total of 28 principal components of these anonymized features is provided. In addition, the time in seconds between transactions is provided, as is the purchase amount (presumably in Euros).

Each record is classified as normal (class “0”) or fraudulent (class “1” ) and the transactions are heavily skewed towards normal. Specifically, there are 492 fraudulent credit card transactions out of a total of 284,807 transactions, which is a total of about 0.172% of all transactions.

It contains a subset of online transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, where the positive class (frauds) account for 0.172% of all transactions …

Calibrating Probability with Undersampling for Unbalanced Classification, 2015.

Some publications use the ROC area under curve metric, although the website for the dataset recommends using the precision-recall area under curve metric, given the severe class imbalance.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC).

Credit Card Fraud Detection, Kaggle.

Next, let’s take a closer look at the data.

Explore the Dataset

First, download and unzip the dataset and save it in your current working directory with the name “creditcard.csv“.

Review the contents of the file.

The first few lines of the file should look as follows:

Note that this version of the dataset has the header line removed. If you download the dataset from Kaggle, you must remove the header line first.

We can see that the first column is the time, which is an integer, and the second last column is the purchase amount. The final column contains the class label. We can see that the PCA transformed features are positive and negative and contain a lot of floating point precision.

The time column is unlikely to be useful and probably can be removed. The difference in scale between the PCA variables and the dollar amount suggests that data scaling should be used for those algorithms that are sensitive to the scale of input variables.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the names of the columns, as there is no header line.

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

We can also summarize the number of examples in each class using the Counter object.

Tying this together, the complete example of loading and summarizing the dataset is listed below.

Running the example first loads the dataset and confirms the number of rows and columns, which are 284,807 rows and 30 input variables and 1 target variable.

The class distribution is then summarized, confirming the severe skew in the class distribution, with about 99.827 percent of transactions marked as normal and about 0.173 percent marked as fraudulent. This generally matches the description of the dataset in the paper.

We can also take a look at the distribution of the input variables by creating a histogram for each.

Because of the large number of variables, the plots can look cluttered. Therefore we will disable the axis labels so that we can focus on the histograms. We will also increase the number of bins used in each histogram to help better see the data distribution.

The complete example of creating histograms of all input variables is listed below.

We can see that the distribution of most of the PCA components is Gaussian, and many may be centered around zero, suggesting that the variables were standardized as part of the PCA transform.

Histogram of Input Variables in the Credit Card Fraud Dataset

Histogram of Input Variables in the Credit Card Fraud Dataset

The amount variable might be interesting and does not appear on the histogram.

This suggests that the distribution of the amount values may be skewed. We can create a 5-number summary of this variable to get a better idea of the transaction sizes.

The complete example is listed below.

Running the example, we can see that most amounts are small, with a mean of about 88 and the middle 50 percent of observations between 5 and 77.

The largest value is about 25,691, which is pulling the distribution up and might be an outlier (e.g. someone purchased a car on their credit card).

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

Model Test and Baseline Result

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 284807/10 or 28,480 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 99.8 percent to 0.2 percent normal and fraudulent transaction respectively. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use 3 repeats.

This means a single model will be fit and evaluated 10 * 3 or 30 times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We will use the recommended metric of area under precision-recall curve or PR AUC.

This requires that a given algorithm first predict a probability or probability-like measure. The predicted probabilities are then evaluated using precision and recall at a range of different thresholds for mapping probability to class labels, and the area under the curve of these thresholds is reported as the performance of the model.

This metric focuses on the positive class, which is desirable for such a severe class imbalance. It also allows the operator of a final model to choose a threshold for mapping probabilities to class labels (fraud or non-fraud transactions) that best balances the precision and recall of the final model.

We can define a function to load the dataset and split the columns into input and output variables. The load_dataset() function below implements this.

We can then define a function that will calculate the precision-recall area under curve for a given set of predictions.

This involves first calculating the precision-recall curve for the predictions via the precision_recall_curve() function. The output recall and precision values for each threshold can then be provided as arguments to the auc() to calculate the area under the curve. The pr_auc() function below implements this.

We can then define a function that will evaluate a given model on the dataset and return a list of PR AUC scores for each fold and repeat.

The evaluate_model() function below implements this, taking the dataset and model as arguments and returning the list of scores. The make_scorer() function is used to define the precision-recall AUC metric and indicates that a model must predict probabilities in order to be evaluated.

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the positive class (class 1) for all examples will provide a baseline performance when using the precision-recall area under curve metric.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “strategy” argument to ‘constant‘ and setting the “constant” argument to ‘1’ to predict the positive class.

Once the model is evaluated, we can report the mean and standard deviation of the PR AUC scores directly.

Tying this together, the complete example of loading the dataset, evaluating a baseline model, and reporting the performance is listed below.

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded and that we have 30 input variables.

Next, the average of the PR AUC scores is reported.

In this case, we can see that the baseline algorithm achieves a mean PR AUC of about 0.501.

This score provides a lower limit on model skill; any model that achieves an average PR AUC above about 0.5 has skill, whereas models that achieve a score below this value do not have skill on this dataset.

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

Evaluate Models

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good but not highly optimized (e.g. hyperparameters are not tuned).

Can you do better? If you can achieve better PR AUC performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Evaluate Machine Learning Algorithms

Let’s start by evaluating a mixture of machine learning models on the dataset.

It can be a good idea to spot check a suite of different nonlinear algorithms on a dataset to quickly flush out what works well and deserves further attention and what doesn’t.

We will evaluate the following machine learning models on the credit card fraud dataset:

  • Decision Tree (CART)
  • k-Nearest Neighbors (KNN)
  • Bagged Decision Trees (BAG)
  • Random Forest (RF)
  • Extra Trees (ET)

We will use mostly default model hyperparameters, with the exception of the number of trees in the ensemble algorithms, which we will set to a reasonable default of 100. We will also standardize the input variables prior to providing them as input to the KNN algorithm.

We will define each model in turn and add them to a list so that we can evaluate them sequentially. The get_models() function below defines the list of models for evaluation, as well as a list of model short names for plotting the results later.

We can then enumerate the list of models in turn and evaluate each, storing the scores for later evaluation.

At the end of the run, we can plot each sample of scores as a box and whisker plot with the same scale so that we can directly compare the distributions.

Tying this all together, the complete example of an evaluation of a suite of machine learning algorithms on the credit card fraud dataset is listed below.

Running the example evaluates each algorithm in turn and reports the mean and standard deviation PR AUC.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that all of the tested algorithms have skill, achieving a PR AUC above the default of 0.5. The results suggest that the ensembles of decision tree algorithms all do well on this dataset, although the KNN with standardization of the dataset seems to perform the best on average.

A figure is created showing one box and whisker plot for each algorithm’s sample of results. The box shows the middle 50 percent of the data, the orange line in the middle of each box shows the median of the sample, and the green triangle in each box shows the mean of the sample.

We can see that the distributions of scores for the KNN and ensembles of decision trees are tight and means seem to coincide with medians, suggesting the distributions may be symmetrical and are probably Gaussian and that the scores are probably quite stable.

Box and Whisker Plot of Machine Learning Models on the Imbalanced Credit Card Fraud Dataset

Box and Whisker Plot of Machine Learning Models on the Imbalanced Credit Card Fraud Dataset

Now that we have seen how to evaluate models on this dataset, let’s look at how we can use a final model to make predictions.

Make Prediction on New Data

In this section, we can fit a final model and use it to make predictions on single rows of data.

We will use the KNN model as our final model that achieved a PR AUC of about 0.867. Fitting the final model involves defining a Pipeline to scale the numerical variables prior to fitting the model.

The Pipeline can then be used to make predictions on new data directly and will automatically scale new data using the same operations as performed on the training dataset.

First, we can define the model as a pipeline.

Once defined, we can fit it on the entire training dataset.

Once fit, we can use it to make predictions for new data by calling the predict_proba() function. This will return the probability for each class.

We can retrieve the predicted probability for the positive class that a operator of the model might use to interpret the prediction.

For example:

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know the outcome.

The complete example is listed below.

Running the example first fits the model on the entire training dataset.

Then the fit model is used to predict the label of normal cases chosen from the dataset file. We can see that all cases are correctly predicted.

Then some fraud cases are used as input to the model and the label is predicted. As we might have hoped, most of the examples are predicted correctly with the default threshold. This highlights the need for a user of the model to select an appropriate probability threshold.

Normal cases:

Further Reading

This section provides more resources on the topic if you are looking to go deeper.





In this tutorial, you discovered how to develop and evaluate a model for the imbalanced credit card fraud classification dataset.

Specifically, you learned:

  • How to load and explore the dataset and generate ideas for data preparation and model selection.
  • How to systematically evaluate a suite of machine learning models with a robust test harness.
  • How to fit a final model and use it to predict the probability of fraud for specific cases.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

Get a Handle on Imbalanced Classification!

Imbalanced Classification with Python

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

Bring Imbalanced Classification Methods to Your Machine Learning Projects

See What's Inside

28 Responses to Imbalanced Classification with the Fraudulent Credit Card Transactions Dataset

  1. Madriss March 11, 2020 at 6:31 am #

    Hi Jason, very good article as always. Don’t you think the results would have been better if you used class_weights on your models ?

    • Jason Brownlee March 11, 2020 at 6:35 am #


      Try it, see if you can get better results.

  2. Alexander March 13, 2020 at 5:45 am #

    Hi Jason, very usful tutor! Please, help to adapt pr_auc and evaluate_model functions for the case of multiclass (few classes) classification.

  3. Fawaz Mokbal March 13, 2020 at 8:25 pm #

    really wonderful, thank you

  4. Ron Stauffer April 24, 2020 at 3:42 am #

    Although my SKFold PR AUC scores generally follow yours: KNN 0.869 (0.042), ET 0.862 (0.043), RF (with balanced class weight) 0.858 (0.046), both RF and ET outperfomed KNN when employed on a 80/20 train/test of the dataset. RF achieved PR AUC score of 0.845 and a Recall score on the fraud class of 0.75 (with 0.94 Precision); ET achieved PR AUC score of 0.838 and also a Recall score of 0.75 (with 0.93 Precision). By comparison, KNN achieved PR AUC score of 0.807and a Recall score of 0.69 (with 0.92 Precision). It is surprising to me that these three models essentially performed equivalently on this dataset.

  5. saurabhk April 28, 2020 at 4:35 pm #

    Hi Jason, Here you didn’t seem to use Cost sensitive algorithm that you discussed in earlier post. Is there any idea or strategy for not using here?

    • Jason Brownlee April 29, 2020 at 6:18 am #

      Yes, I used an appropriate metric, then I could not achieve better performance than using standard models with that metric.

      If you can do better on the same test harness, please share your results.

  6. Pierre May 21, 2020 at 3:45 am #

    Hi Jason, nice article!
    I’m having trouble understanding how the DummyClassifier can bring a PR AUC of 0.501. It should have a fixed recall of 100% and precision of .2%, whatever the threshold. Am I missing something here?


  7. Gianluca Bontempi July 3, 2020 at 1:28 am #

    For more research works related to this topic refer to

  8. Pavan October 8, 2020 at 7:49 pm #

    Thank you Jason for this succinct and well explained post. I would like to ask a question, as to why no under sampling or oversampling was performed as the data is skewed with very less fraud transactions. Thank you!!

    • Jason Brownlee October 9, 2020 at 6:42 am #

      You’re welcome.

      It did not help on this problem.

  9. ashish tiwari November 10, 2020 at 5:36 am #

    I have got much better results in simple way from most of the algorithms .

    by svm= 99.93504441557529
    by random_forest = 99.96137776061234
    by decision tree = 99.92275552122467
    by knn= 99.95611109160492
    by logistic regression = 99.91748885221726
    by Gradient Boosting Classifier = 99.77177767634564

    • Jason Brownlee November 10, 2020 at 6:47 am #

      Wow, did you use the same evaluation procedure as above?

      • kiki December 29, 2020 at 12:19 pm #

        me too..
        I train with 100% data of the dataset and test with the same data, I got above 99% accuracy. however, when I try to split 50% data for training and 50% for testing, the result was very terrible.

        • Jason Brownlee December 29, 2020 at 1:32 pm #

          Perhaps use repeated k-fold cross-validation to estimate model performance instead of a simple train/test split.

  10. Ram January 22, 2021 at 4:51 pm #

    Hi Jason,

    Can you please make a blog post on anomaly detection for a data without any prior labels? Also, if you have resources, please point me to streaming anomaly detection implementations. Thanks a lot.

  11. Anil Joshi March 31, 2021 at 5:54 am #

    Hello Jason, Thanks for posting this!

    Quick question – Why are we using the strategy as ‘constant’ where as it should be ‘stratified’ for PR-AUC metric? Am I missing something here?

    I think you mentioned this in one of you posts:

    “Predicting a constant value, like the majority class or minority class will result in an invalid PR Curve (e.g. a point) and in turn an invalid PR AUC score. Scores for models that predict a constant value should be ignored”


  12. Anil Joshi April 7, 2021 at 8:35 am #

    Cool, Thanks for the response! So, the correct sampling strategy should be stratified then.

Leave a Reply