How to Develop an Imbalanced Classification Model to Detect Oil Spills

Many imbalanced classification tasks require a skillful model that predicts a crisp class label, where both classes are equally important.

An example of an imbalanced classification problem where a class label is required and both classes are equally important is the detection of oil spills or slicks in satellite images. The detection of a spill requires mobilizing an expensive response, and missing an event is equally expensive, causing damage to the environment.

One way to evaluate imbalanced classification models that predict crisp labels is to calculate the separate accuracy on the positive class and the negative class, referred to as sensitivity and specificity. These two measures can then be averaged using the geometric mean, referred to as the G-mean, that is insensitive to the skewed class distribution and correctly reports on the skill of the model on both classes.

In this tutorial, you will discover how to develop a model to predict the presence of an oil spill in satellite images and evaluate it using the G-mean metric.

After completing this tutorial, you will know:

• How to load and explore the dataset and generate ideas for data preparation and model selection.
• How to evaluate a suite of probabilistic models and improve their performance with appropriate data preparation.
• How to fit a final model and use it to predict class labels for specific cases.

Kick-start your project with my new book Imbalanced Classification with Python, including step-by-step tutorials and the Python source code files for all examples.

Letâ€™s get started.

• Updated Jan/2021: Updated links for API documentation.

Develop an Imbalanced Classification Model to Detect Oil Spills
Photo by Lenny K Photography, some rights reserved.

Tutorial Overview

This tutorial is divided into five parts; they are:

1. Oil Spill Dataset
2. Explore the Dataset
3. Model Test and Baseline Result
4. Evaluate Models
1. Evaluate Probabilistic Models
2. Evaluate Balanced Logistic Regression
3. Evaluate Resampling With Probabilistic Models
5. Make Prediction on New Data

Oil Spill Dataset

In this project, we will use a standard imbalanced machine learning dataset referred to as the “oil spill” dataset, “oil slicks” dataset or simply “oil.”

The dataset was introduced in the 1998 paper by Miroslav Kubat, et al. titled “Machine Learning for the Detection of Oil Spills in Satellite Radar Images.” The dataset is often credited to Robert Holte, a co-author of the paper.

The dataset was developed by starting with satellite images of the ocean, some of which contain an oil spill and some that do not. Images were split into sections and processed using computer vision algorithms to provide a vector of features to describe the contents of the image section or patch.

The input to [the system] is a raw pixel image from a radar satellite Image processing techniques are used […] The output of the image processing is a fixed-length feature vector for each suspicious region. During normal operation these feature vectors are fed into a classier to decide which images and which regions within an image to present for human inspection.

The task is given a vector that describes the contents of a patch of a satellite image, then predicts whether the patch contains an oil spill or not, e.g. from the illegal or accidental dumping of oil in the ocean.

There are 937 cases. Each case is comprised of 48 numerical computer vision derived features, a patch number, and a class label.

A total of nine satellite images were processed into patches. Cases in the dataset are ordered by image and the first column of the dataset represents the patch number for the image. This was provided for the purposes of estimating model performance per-image. In this case, we are not interested in the image or patch number and this first column can be removed.

The normal case is no oil spill assigned the class label of 0, whereas an oil spill is indicated by a class label of 1. There are 896 cases for no oil spill and 41 cases of an oil spill.

The second critical feature of the oil spill domain can be called an imbalanced training set: there are very many more negative examples lookalikes than positive examples oil slicks. Against the 41 positive examples we have 896 negative examples the majority class thus comprises almost 96% of the data.

We do not have access to the program used to prepare computer vision features from the satellite images, therefore we are restricted to work with the extracted features that were collected and made available.

Next, let’s take a closer look at the data.

Want to Get Started With Imbalance Classification?

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Explore the Dataset

Review the contents of the file.

The first few lines of the file should look as follows:

We can see that the first column contains integers for the patch number. We can also see that the computer vision derived features are real-valued with differing scales such as thousands in the second column and fractions in other columns.

All input variables are numeric, and there are no missing values marked with a “?” character.

Firstly, we can load the CSV dataset and confirm the number of rows and columns.

The dataset can be loaded as a DataFrame using the read_csv() Pandas function, specifying the location and the fact that there is no header line.

Once loaded, we can summarize the number of rows and columns by printing the shape of the DataFrame.

We can also summarize the number of examples in each class using the Counter object.

Tying this together, the complete example of loading and summarizing the dataset is listed below.

Running the example first loads the dataset and confirms the number of rows and columns.

The class distribution is then summarized, confirming the number of oil spills and non-spills and the percentage of cases in the minority and majority classes.

We can also take a look at the distribution of each variable by creating a histogram for each.

With 50 variables, it is a lot of plots, but we might spot some interesting patterns. Also, with so many plots, we must turn off the axis labels and plot titles to reduce the clutter. The complete example is listed below.

Running the example creates the figure with one histogram subplot for each of the 50 variables in the dataset.

We can see many different distributions, some with Gaussian-like distributions, others with seemingly exponential or discrete distributions.

Depending on the choice of modeling algorithms, we would expect scaling the distributions to the same range to be useful, and perhaps the use of some power transforms.

Histogram of Each Variable in the Oil Spill Dataset

Now that we have reviewed the dataset, let’s look at developing a test harness for evaluating candidate models.

Model Test and Baseline Result

We will evaluate candidate models using repeated stratified k-fold cross-validation.

The k-fold cross-validation procedure provides a good general estimate of model performance that is not too optimistically biased, at least compared to a single train-test split. We will use k=10, meaning each fold will contain about 937/10 or about 94 examples.

Stratified means that each fold will contain the same mixture of examples by class, that is about 96% to 4% non-spill and spill. Repeated means that the evaluation process will be performed multiple times to help avoid fluke results and better capture the variance of the chosen model. We will use three repeats.

This means a single model will be fit and evaluated 10 * 3, or 30, times and the mean and standard deviation of these runs will be reported.

This can be achieved using the RepeatedStratifiedKFold scikit-learn class.

We are predicting class labels of whether a satellite image patch contains a spill or not. There are many measures we could use, although the authors of the paper chose to report the sensitivity, specificity, and the geometric mean of the two scores, called the G-mean.

To this end, we have mainly used the geometric mean (g-mean) […] This measure has the distinctive property of being independent of the distribution of examples between classes, and is thus robust in circumstances where this distribution might change with time or be different in the training and testing sets.

Recall that the sensitivity is a measure of the accuracy for the positive class and specificity is a measure of the accuracy of the negative class.

• Sensitivity = TruePositives / (TruePositives + FalseNegatives)
• Specificity = TrueNegatives / (TrueNegatives + FalsePositives)

The G-mean seeks a balance of these scores, the geometric mean, where poor performance for one or the other results in a low G-mean score.

• G-Mean = sqrt(Sensitivity * Specificity)

We can calculate the G-mean for a set of predictions made by a model using the geometric_mean_score() function provided by the imbalanced-learn library.

First, we can define a function to load the dataset and split the columns into input and output variables. We will also drop column 22 because the column contains a single value, and the first column that defines the image patch number. The load_dataset() function below implements this.

We can then define a function that will evaluate a given model on the dataset and return a list of G-Mean scores for each fold and repeat.

The evaluate_model() function below implements this, taking the dataset and model as arguments and returning the list of scores.

Finally, we can evaluate a baseline model on the dataset using this test harness.

A model that predicts the majority class label (0) or the minority class label (1) for all cases will result in a G-mean of zero. As such, a good default strategy would be to randomly predict one class label or another with a 50% probability and aim for a G-mean of about 0.5.

This can be achieved using the DummyClassifier class from the scikit-learn library and setting the “strategy” argument to ‘uniform‘.

Once the model is evaluated, we can report the mean and standard deviation of the G-mean scores directly.

Tying this together, the complete example of loading the dataset, evaluating a baseline model and reporting the performance is listed below.

Running the example first loads and summarizes the dataset.

We can see that we have the correct number of rows loaded, and that we have 47 computer vision derived input variables, with the constant value column (index 22) and the patch number column (index 0) removed.

Importantly, we can see that the class labels have the correct mapping to integers with 0 for the majority class and 1 for the minority class, customary for imbalanced binary classification dataset.

Next, the average of the G-Mean scores is reported.

In this case, we can see that the baseline algorithm achieves a G-Mean of about 0.47, close to the theoretical maximum of 0.5. This score provides a lower limit on model skill; any model that achieves an average G-Mean above about 0.47 (or really above 0.5) has skill, whereas models that achieve a score below this value do not have skill on this dataset.

It is interesting to note that a good G-mean reported in the paper was about 0.811, although the model evaluation procedure was different. This provides a rough target for “good” performance on this dataset.

Now that we have a test harness and a baseline in performance, we can begin to evaluate some models on this dataset.

Evaluate Models

In this section, we will evaluate a suite of different techniques on the dataset using the test harness developed in the previous section.

The goal is to both demonstrate how to work through the problem systematically and to demonstrate the capability of some techniques designed for imbalanced classification problems.

The reported performance is good, but not highly optimized (e.g. hyperparameters are not tuned).

What score can you get? If you can achieve better G-mean performance using the same test harness, I’d love to hear about it. Let me know in the comments below.

Evaluate Probabilistic Models

Let’s start by evaluating some probabilistic models on the dataset.

Probabilistic models are those models that are fit on the data under a probabilistic framework and often perform well in general for imbalanced classification datasets.

We will evaluate the following probabilistic models with default hyperparameters in the dataset:

• Logistic Regression (LR)
• Linear Discriminant Analysis (LDA)
• Gaussian Naive Bayes (NB)

Both LR and LDA are sensitive to the scale of the input variables, and often expect and/or perform better if input variables with different scales are normalized or standardized as a pre-processing step.

In this case, we will standardize the dataset prior to fitting each model. This will be achieved using a Pipeline and the StandardScaler class. The use of a Pipeline ensures that the StandardScaler is fit on the training dataset and applied to the train and test sets within each k-fold cross-validation evaluation, avoiding any data leakage that might result in an optimistic result.

We can define a list of models to evaluate on our test harness as follows:

Once defined, we can enumerate the list and evaluate each in turn. The mean and standard deviation of G-mean scores can be printed during evaluation and the sample of scores can be stored.

Algorithms can be compared directly based on their mean G-mean score.

At the end of the run, we can use the scores to create a box and whisker plot for each algorithm.

Creating the plots side by side allows the distributions to be compared both with regard to the mean score, but also the middle 50 percent of the distribution between the 25th and 75th percentiles.

Tying this together, the complete example comparing three probabilistic models on the oil spill dataset using the test harness is listed below.

Running the example evaluates each of the probabilistic models on the dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You may see some warnings from the LDA algorithm such as “Variables are collinear“. These can be safely ignored for now, but suggests that the algorithm could benefit from feature selection to remove some of the variables.

In this case, we can see that each algorithm has skill, achieving a mean G-mean above 0.5. The results suggest that an LDA might be the best performing of the models tested.

The distribution of the G-mean scores is summarized using a figure with a box and whisker plot for each algorithm. We can see that the distribution for both LDA and NB is compact and skillful and that the LR may have a few results during the run where the method performed poorly, pushing the distribution down.

This highlights that it is not just the mean performance, but also the consistency of the model that should be considered when selecting a model.

Box and Whisker Plot of Probabilistic Models on the Imbalanced Oil Spill Dataset

We’re off to a good start, but we can do better.

Evaluate Balanced Logistic Regression

The logistic regression algorithm supports a modification that adjusts the importance of classification errors to be inversely proportional to the class weighting.

This allows the model to better learn the class boundary in favor of the minority class, which might help overall G-mean performance. We can achieve this by setting the “class_weight” argument of the LogisticRegression to ‘balanced‘.

As mentioned, logistic regression is sensitive to the scale of input variables and can perform better with normalized or standardized inputs; as such it is a good idea to test both for a given dataset. Additionally, a power distribution can be used to spread out the distribution of each input variable and make those variables with a Gaussian-like distribution more Gaussian. This can benefit models like Logistic Regression that make assumptions about the distribution of input variables.

The power transom will use the Yeo-Johnson method that supports positive and negative inputs, but we will also normalize data prior to the transform. Also, the PowerTransformer class used for the transform will also standardize each variable after the transform.

We will compare a LogisticRegression with a balanced class weighting to the same algorithm with three different data preparation schemes, specifically normalization, standardization, and a power transform.

Tying this together, the comparison of balanced logistic regression with different data preparation schemes is listed below.

Running the example evaluates each version of the balanced logistic regression model on the dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

You may see some warnings from the first balanced LR model, such as “Liblinear failed to converge“. These warnings can be safely ignored for now but suggest that the algorithm could benefit from feature selection to remove some of the variables.

In this case, we can see that the balanced version of logistic regression performs much better than all of the probabilistic models evaluated in the previous section.

The results suggest that perhaps the use of balanced LR with data normalization for pre-processing performs the best on this dataset with a mean G-mean score of about 0.852. This is in the range or better than the results reported in the 1998 paper.

A figure is created with box and whisker plots for each algorithm, allowing the distribution of results to be compared.

We can see that the distribution for the balanced LR is tighter in general than the non-balanced version in the previous section. We can also see that the median result (orange line) for the normalized version is higher than the mean, above 0.9, which is impressive. A mean different from the median suggests a skewed distribution for the results, pulling the mean down with a few bad outcomes.

Box and Whisker Plot of Balanced Logistic Regression Models on the Imbalanced Oil Spill Dataset

We now have excellent results with little work; let’s see if we can take it one step further.

Evaluate Data Sampling With Probabilistic Models

Data sampling provides a way to better prepare the imbalanced training dataset prior to fitting a model.

Perhaps the most popular data sampling is the SMOTE oversampling technique for creating new synthetic examples for the minority class. This can be paired with the edited nearest neighbor (ENN) algorithm that will locate and remove examples from the dataset that are ambiguous, making it easier for models to learn to discriminate between the two classes.

This combination is called SMOTE-ENN and can be implemented using the SMOTEENN class from the imbalanced-learn library; for example:

SMOTE and ENN both work better when the input data is scaled beforehand. This is because both techniques involve using the nearest neighbor algorithm internally and this algorithm is sensitive to input variables with different scales. Therefore, we will require the data to be normalized as a first step, then sampled, then used as input to the (unbalanced) logistic regression model.

As such, we can use the Pipeline class provided by the imbalanced-learn library to create a sequence of data transforms including the data sampling method, and ending with the logistic regression model.

We will compare four variations of the logistic regression model with data sampling, specifically:

• SMOTEENN + LR
• Normalization + SMOTEENN + LR
• Standardization + SMOTEENN + LR
• Normalization + Power + SMOTEENN + LR

The expectation is that LR will perform better with SMOTEENN, and that SMOTEENN will perform better with standardization or normalization. The last case does a lot, first normalizing the dataset, then applying the power transform, standardizing the result (recall that the PowerTransformer class will standardize the output by default), applying SMOTEENN, then finally fitting a logistic regression model.

These combinations can be defined as follows:

Tying this together, the complete example is listed below.

Running the example evaluates each version of the SMOTEENN with logistic regression model on the dataset.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the addition of SMOTEENN improves the performance of the default LR algorithm, achieving a mean G-mean of 0.852 compared to 0.621 seen in the first set of experimental results. This is even better than balanced LR without any data scaling (previous section) that achieved a G-mean of about 0.846.

The results suggest that perhaps the final combination of normalization, power transform, and standardization achieves a slightly better score than the default LR with SMOTEENN with a G-mean of about 0.873, although the warning messages suggest some problems that need to be ironed out.

The distribution of results can be compared with box and whisker plots. We can see the distributions all roughly have the same tight spread and that the difference in means of the results can be used to select a model.

Box and Whisker Plot of Logistic Regression Models with Data Sampling on the Imbalanced Oil Spill Dataset

Make Prediction on New Data

The use of SMOTEENN with Logistic Regression directly without any data scaling probably provides the simplest and well-performing model that could be used going forward.

This model had a mean G-mean of about 0.852 on our test harness.

We will use this as our final model and use it to make predictions on new data.

First, we can define the model as a pipeline.

Once defined, we can fit it on the entire training dataset.

Once fit, we can use it to make predictions for new data by calling the predict() function. This will return the class label of 0 for no oil spill, or 1 for an oil spill.

For example:

To demonstrate this, we can use the fit model to make some predictions of labels for a few cases where we know there is no oil spill, and a few where we know there is.

The complete example is listed below.

Running the example first fits the model on the entire training dataset.

Then the fit model used to predict the label of an oil spill for cases where we know there is none, chosen from the dataset file. We can see that all cases are correctly predicted.

Then some cases of actual oil spills are used as input to the model and the label is predicted. As we might have hoped, the correct labels are again predicted.

This section provides more resources on the topic if you are looking to go deeper.

Summary

In this tutorial, you discovered how to develop a model to predict the presence of an oil spill in satellite images and evaluate it using the G-mean metric.

Specifically, you learned:

• How to load and explore the dataset and generate ideas for data preparation and model selection.
• How to evaluate a suite of probabilistic models and improve their performance with appropriate data preparation.
• How to fit a final model and use it to predict class labels for specific cases.

Do you have any questions?

Get a Handle on Imbalanced Classification!

Develop Imbalanced Learning Models in Minutes

...with just a few lines of python code

Discover how in my new Ebook:
Imbalanced Classification with Python

It provides self-study tutorials and end-to-end projects on:
Performance Metrics, Undersampling Methods, SMOTE, Threshold Moving, Probability Calibration, Cost-Sensitive Algorithms
and much more...

30 Responses to How to Develop an Imbalanced Classification Model to Detect Oil Spills

1. Edivaldo February 22, 2020 at 4:16 am #

Hi Jason,

Congratulations.

2. Tiago Cordeiro February 22, 2020 at 10:07 am #

Hi Jason,

Thanks for the article.
I want help with the book I have to buy.

I have a classic imbalance problem: predicting credit defaults for large corporations.
But I don’t want to treat this problem with classic methods (like logistic regression or LDA).
I saw in an article that the heterogeneous set (medium weight set) did an excellent job on a similar problem.

Which book is the most advisable?

Thank you.

• Tiago Cordeiro February 22, 2020 at 12:55 pm #

#Heterogeneous ensembles(Weighted average ensemble)

• Jason Brownlee February 23, 2020 at 7:21 am #
• Tyler OConnell February 25, 2020 at 4:32 pm #

Excellent work! I love it ðŸ˜€

3. Ganesh Yenurkar February 24, 2020 at 4:38 am #

hi sir…nice work in the field of Machine learning…actually I am now pursuing My PhD….Kindly suggest me some kind of the research components to carry out my research work….

4. Shkr hurrah February 25, 2020 at 9:47 am #

Hey ,Jason Brownlee
Will you please write an eloberated article on text summerisation NLP .

• Jason Brownlee February 25, 2020 at 11:18 am #

Thanks for the suggestion.

5. MUneeb February 25, 2020 at 9:34 pm #

I want to know how to deal with the data which is not Normal for the regression task. LIke price prediction where data is not normally distributed. How can we deal with that?

• Jason Brownlee February 26, 2020 at 8:18 am #

Try modeling it directly as a baseline. Then perhaps try power transform and other transforms and review any change/lift in performance.

6. abc February 26, 2020 at 10:33 am #

Seems typo in smoteenn code:

SMOTEENN(enn=EditedNearestNeighbours(sampling_strategy=’minority’))

• Jason Brownlee February 26, 2020 at 11:40 am #

Thanks, what is the typo exactly?

7. Carlos March 5, 2020 at 3:19 am #

Hi Jason,

Thanks again for you detailed and excellent explanation of each step in building the model.

Did you choose, a logistic regression model for the sake of reproducibility with the paper?. Just curious if with a Boosting algorithm you could get a higher accuracy.

Thanks!
Carlos.

• Jason Brownlee March 5, 2020 at 6:40 am #

No, I tested a ton of algorithms and found that it performed well.

8. Jeremie March 17, 2020 at 4:41 am #

Thanks for this presentation.
I am quite uncomfortable about estimating the performance using the resampled dataset. I mean using the output dataset of the resampling function (Smoteenn in this article). I would prefer to use the original data.
Nevertheless, if we expect the a smoothed boundary to the function to be estimate, the resampling makes the boundary more “visible” also possibly more noisy.
In the other hand, one advandtave of smoteenn is the removing step that approximatly keep the resampled dataset in a convex space (almost).

Still, I don’t have a clear overview of this, do you ?
Thanks,

• Jason Brownlee March 17, 2020 at 8:23 am #

I think you misunderstand data resampling.

We fit on resampled data, but we estimate on raw data.

• Jeremie March 18, 2020 at 8:00 pm #

I am agree with this strategy : fit on resampled data, and estimate on raw data.

Nevertheless, I don’t understand where in the whole code from section “Evaluate Data Sampling With Probabilistic Models”, the performance estimation is computed only on raw data.
In the cross-validation loop “cross_val_score”, you give the raw data (X, y) to the pipeline that contains the resampling method. Do you mean that the cross_val_score apply the resampling only for the training step and keep the raw data for the prediction/evaluation ? I don’t think it is the case.
Thanks!

9. Thierry April 23, 2020 at 11:22 pm #

Great tutorial, as always!

I have a question regarding the PowerTransformer step in the pipeline.

I would imagine than only some of the features, not all, need this kind of transformation to have a more “Gaussian-like” distribution.

Doesn’t it hurt other features when we apply this transformation to the whole dataset?
Would there be a way, within the pipeline, to tell PowerTransformer to apply the transformation only to certain features and not others?

• Thierry April 23, 2020 at 11:30 pm #

Update : I tried to achieve this by defining a custom function which fits the transformer to selected columns only. I’m not sure if that makes sense or not. It doesn’t have any impact on the score at all…

def powtransf():
pt = PowerTransformer()
columns=[0,1,2,4,7,9,15,16,21,26,30,32,33,38,39,40,41,42,43,44,45,46]
return pt.fit(X[:,columns])

[…]

models.append(Pipeline(steps=[(‘t’, MinMaxScaler()),(‘p’,powtransf()),(‘m’, LogisticRegression(solver=’liblinear’, class_weight = ‘balanced’))]))

• Jason Brownlee April 24, 2020 at 5:45 am #

If no impact on score, it is not needed on this task, might be useful on other projects…

• Jason Brownlee April 24, 2020 at 5:44 am #

Very likely.

Ideally, we would check each feature in turn.

10. Fan June 27, 2020 at 7:08 pm #

Hello, I would like to ask, what are the digitized image features, I can not find 50 so many, what way do you get so many digital features?

• Jason Brownlee June 28, 2020 at 5:45 am #

You can see the paper “Machine Learning for the Detection of Oil Spills in Satellite Radar Images.” linked in the tutorial that explains the data.

11. Leidy Castro April 21, 2021 at 6:08 am #

Hi, thanks for sharing this, I am novice in Machine Learning. My question is: How could I apply this model to a new image? I don’t really understand what the features are and how to convert satellite images to use this model.

• Jason Brownlee April 22, 2021 at 5:35 am #

You would have to extract features from the new image then feed those features into a model.

In this case, we don’t have the tools used to extract features from images.