Auto-Sklearn for Automated Machine Learning in Python

By Jason Brownlee on September 12, 2020 in Python Machine Learning 75

Automated Machine Learning (AutoML) refers to techniques for automatically discovering well-performing models for predictive modeling tasks with very little user involvement.

Auto-Sklearn is an open-source library for performing AutoML in Python. It makes use of the popular Scikit-Learn machine learning library for data transforms and machine learning algorithms and uses a Bayesian Optimization search procedure to efficiently discover a top-performing model pipeline for a given dataset.

In this tutorial, you will discover how to use Auto-Sklearn for AutoML with Scikit-Learn machine learning algorithms in Python.

After completing this tutorial, you will know:

Auto-Sklearn is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use Auto-Sklearn to automatically discover top-performing models for classification tasks.
How to use Auto-Sklearn to automatically discover top-performing models for regression tasks.

Let’s get started.

Auto-Sklearn for Automated Machine Learning in Python
Photo by Richard, some rights reserved.

Tutorial Overview

This tutorial is divided into four parts; they are:

AutoML With Auto-Sklearn
Install and Using Auto-Sklearn
Auto-Sklearn for Classification
Auto-Sklearn for Regression

AutoML With Auto-Sklearn

Automated Machine Learning, or AutoML for short, is a process of discovering the best-performing pipeline of data transforms, model, and model configuration for a dataset.

AutoML often involves the use of sophisticated optimization algorithms, such as Bayesian Optimization, to efficiently navigate the space of possible models and model configurations and quickly discover what works well for a given predictive modeling task. It allows non-expert machine learning practitioners to quickly and easily discover what works well or even best for a given dataset with very little technical background or direct input.

Auto-Sklearn is an open-source Python library for AutoML using machine learning models from the scikit-learn machine learning library.

It was developed by Matthias Feurer, et al. and described in their 2015 paper titled “Efficient and Robust Automated Machine Learning.”

… we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters).

— Efficient and Robust Automated Machine Learning, 2015.

The benefit of Auto-Sklearn is that, in addition to discovering the data preparation and model that performs for a dataset, it also is able to learn from models that performed well on similar datasets and is able to automatically create an ensemble of top-performing models discovered as part of the optimization process.

This system, which we dub AUTO-SKLEARN, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization.

— Efficient and Robust Automated Machine Learning, 2015.

The authors provide a useful depiction of their system in the paper, provided below.

Overview of the Auto-Sklearn System.
Taken from: Efficient and Robust Automated Machine Learning, 2015.

Install and Using Auto-Sklearn

The first step is to install the Auto-Sklearn library, which can be achieved using pip, as follows:

sudo pip install autosklearn

1	sudo pip install autosklearn

Once installed, we can import the library and print the version number to confirm it was installed successfully:

# print autosklearn version
import autosklearn
print('autosklearn: %s' % autosklearn.__version__)

# print autosklearn version

import autosklearn

print('autosklearn: %s' % autosklearn.__version__)

Running the example prints the version number.

Your version number should be the same or higher.

autosklearn: 0.6.0

1	autosklearn: 0.6.0

Using Auto-Sklearn is straightforward.

Depending on whether your prediction task is classification or regression, you create and configure an instance of the AutoSklearnClassifier or AutoSklearnRegressor class, fit it on your dataset, and that’s it. The resulting model can then be used to make predictions directly or saved to file (using pickle) for later use.

...
# define search
model = AutoSklearnClassifier()
# perform the search
model.fit(X_train, y_train)

...

# define search

model = AutoSklearnClassifier()

# perform the search

model.fit(X_train, y_train)

There are a ton of configuration options provided as arguments to the AutoSklearn class.

By default, the search will use a train-test split of your dataset during the search, and this default is recommended both for speed and simplicity.

Importantly, you should set the “n_jobs” argument to the number of cores in your system, e.g. 8 if you have 8 cores.

The optimization process will run for as long as you allow, measure in minutes. By default, it will run for one hour.

I recommend setting the “time_left_for_this_task” argument for the number of seconds you want the process to run. E.g. less than 5-10 minutes is probably plenty for many small predictive modeling tasks (sub 1,000 rows).

We will use 5 minutes (300 seconds) for the examples in this tutorial. We will also limit the time allocated to each model evaluation to 30 seconds via the “per_run_time_limit” argument. For example:

...
# define search
model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

...

# define search

model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

You can limit the algorithms considered in the search, as well as the data transforms.

By default, the search will create an ensemble of top-performing models discovered as part of the search. Sometimes, this can lead to overfitting and can be disabled by setting the “ensemble_size” argument to 1 and “initial_configurations_via_metalearning” to 0.

...
# define search
model = AutoSklearnClassifier(ensemble_size=1, initial_configurations_via_metalearning=0)

...

# define search

model = AutoSklearnClassifier(ensemble_size=1, initial_configurations_via_metalearning=0)

At the end of a run, the list of models can be accessed, as well as other details.

Perhaps the most useful feature is the sprint_statistics() function that summarizes the search and the performance of the final model.

...
# summarize performance
print(model.sprint_statistics())

...

# summarize performance

print(model.sprint_statistics())

Now that we are familiar with the Auto-Sklearn library, let’s look at some worked examples.

Auto-Sklearn for Classification

In this section, we will use Auto-Sklearn to discover a model for the sonar dataset.

The sonar dataset is a standard machine learning dataset comprised of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve an accuracy of about 53 percent. A top-performing model can achieve accuracy on this same test harness of about 88 percent. This provides the bounds of expected performance on this dataset.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

# summarize the sonar dataset

from pandas import read_csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

# split into input and output elements

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

1	(208, 60) (208,)

We will use Auto-Sklearn to find a good model for the sonar dataset.

First, we will split the dataset into train and test sets and allow the process to find a good model on the training set, then later evaluate the performance of what was found on the holdout test set.

...
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

...

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

The AutoSklearnClassifier is configured to run for 5 minutes with 8 cores and limit each model evaluation to 30 seconds.

...
# define search
model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

...

# define search

model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

The search is then performed on the training dataset.

...
# perform the search
model.fit(X_train, y_train)

...

# perform the search

model.fit(X_train, y_train)

Afterward, a summary of the search and best-performing model is reported.

...
# summarize
print(model.sprint_statistics())

...

# summarize

print(model.sprint_statistics())

Finally, we evaluate the performance of the model that was prepared on the holdout test dataset.

...
# evaluate best model
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

...

# evaluate best model

y_hat = model.predict(X_test)

acc = accuracy_score(y_test, y_hat)

print("Accuracy: %.3f" % acc)

Tying this together, the complete example is listed below.

# example of auto-sklearn for the sonar classification dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from autosklearn.classification import AutoSklearnClassifier
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# print(dataframe.head())
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)
# perform the search
model.fit(X_train, y_train)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(X_test)
acc = accuracy_score(y_test, y_hat)
print("Accuracy: %.3f" % acc)

# example of auto-sklearn for the sonar classification dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import LabelEncoder

from sklearn.metrics import accuracy_score

from autosklearn.classification import AutoSklearnClassifier

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

# print(dataframe.head())

# split into input and output elements

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

# minimally prepare dataset

X = X.astype('float32')

y = LabelEncoder().fit_transform(y.astype('str'))

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define search

model = AutoSklearnClassifier(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

# perform the search

model.fit(X_train, y_train)

# summarize

print(model.sprint_statistics())

# evaluate best model

y_hat = model.predict(X_test)

acc = accuracy_score(y_test, y_hat)

print("Accuracy: %.3f" % acc)

Running the example will take about five minutes, given the hard limit we imposed on the run.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

At the end of the run, a summary is printed showing that 1,054 models were evaluated and the estimated performance of the final model was 91 percent.

auto-sklearn results:
Dataset name: f4c282bd4b56d4db7e5f7fe1a6a8edeb
Metric: accuracy
Best validation score: 0.913043
Number of target algorithm runs: 1054
Number of successful target algorithm runs: 952
Number of crashed target algorithm runs: 94
Number of target algorithms that exceeded the time limit: 8
Number of target algorithms that exceeded the memory limit: 0

auto-sklearn results:

Dataset name: f4c282bd4b56d4db7e5f7fe1a6a8edeb

Metric: accuracy

Best validation score: 0.913043

Number of target algorithm runs: 1054

Number of successful target algorithm runs: 952

Number of crashed target algorithm runs: 94

Number of target algorithms that exceeded the time limit: 8

Number of target algorithms that exceeded the memory limit: 0

We then evaluate the model on the holdout dataset and see that classification accuracy of 81.2 percent was achieved, which is reasonably skillful.

Accuracy: 0.812

1	Accuracy: 0.812

Auto-Sklearn for Regression

In this section, we will use Auto-Sklearn to discover a model for the auto insurance dataset.

The auto insurance dataset is a standard machine learning dataset comprised of 63 rows of data with one numerical input variable and a numerical target variable.

Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve a mean absolute error (MAE) of about 66. A top-performing model can achieve a MAE on this same test harness of about 28. This provides the bounds of expected performance on this dataset.

The dataset involves predicting the total amount in claims (thousands of Swedish Kronor) given the number of claims for different geographical regions.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the auto insurance dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

# summarize the auto insurance dataset

from pandas import read_csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'

dataframe = read_csv(url, header=None)

# split into input and output elements

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 63 rows of data with one input variable.

(63, 1) (63,)

1	(63, 1) (63,)

We will use Auto-Sklearn to find a good model for the auto insurance dataset.

We can use the same process as was used in the previous section, although we will use the AutoSklearnRegressor class instead of the AutoSklearnClassifier.

...
# define search
model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

...

# define search

model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

By default, the regressor will optimize the R^2 metric.

In this case, we are interested in the mean absolute error, or MAE, which we can specify via the “metric” argument when calling the fit() function.

...
# perform the search
model.fit(X_train, y_train, metric=auto_mean_absolute_error)

...

# perform the search

model.fit(X_train, y_train, metric=auto_mean_absolute_error)

The complete example is listed below.

# example of auto-sklearn for the insurance regression dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from autosklearn.regression import AutoSklearnRegressor
from autosklearn.metrics import mean_absolute_error as auto_mean_absolute_error
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
data = data.astype('float32')
X, y = data[:, :-1], data[:, -1]
# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# define search
model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)
# perform the search
model.fit(X_train, y_train, metric=auto_mean_absolute_error)
# summarize
print(model.sprint_statistics())
# evaluate best model
y_hat = model.predict(X_test)
mae = mean_absolute_error(y_test, y_hat)
print("MAE: %.3f" % mae)

# example of auto-sklearn for the insurance regression dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_absolute_error

from autosklearn.regression import AutoSklearnRegressor

from autosklearn.metrics import mean_absolute_error as auto_mean_absolute_error

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/auto-insurance.csv'

dataframe = read_csv(url, header=None)

# split into input and output elements

data = dataframe.values

data = data.astype('float32')

X, y = data[:, :-1], data[:, -1]

# split into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# define search

model = AutoSklearnRegressor(time_left_for_this_task=5*60, per_run_time_limit=30, n_jobs=8)

# perform the search

model.fit(X_train, y_train, metric=auto_mean_absolute_error)

# summarize

print(model.sprint_statistics())

# evaluate best model

y_hat = model.predict(X_test)

mae = mean_absolute_error(y_test, y_hat)

print("MAE: %.3f" % mae)

Running the example will take about five minutes, given the hard limit we imposed on the run.

You might see some warning messages during the run and you can safely ignore them, such as:

Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 1.0 for quality scenarios. (Change value through "cost_for_crash"-option.)

1	Target Algorithm returned NaN or inf as quality. Algorithm run is treated as CRASHED, cost is set to 1.0 for quality scenarios. (Change value through "cost_for_crash"-option.)

At the end of the run, a summary is printed showing that 1,759 models were evaluated and the estimated performance of the final model was a MAE of 29.

auto-sklearn results:
Dataset name: ff51291d93f33237099d48c48ee0f9ad
Metric: mean_absolute_error
Best validation score: 29.911203
Number of target algorithm runs: 1759
Number of successful target algorithm runs: 1362
Number of crashed target algorithm runs: 394
Number of target algorithms that exceeded the time limit: 3
Number of target algorithms that exceeded the memory limit: 0

auto-sklearn results:

Dataset name: ff51291d93f33237099d48c48ee0f9ad

Metric: mean_absolute_error

Best validation score: 29.911203

Number of target algorithm runs: 1759

Number of successful target algorithm runs: 1362

Number of crashed target algorithm runs: 394

Number of target algorithms that exceeded the time limit: 3

Number of target algorithms that exceeded the memory limit: 0

We then evaluate the model on the holdout dataset and see that a MAE of 26 was achieved, which is a great result.

MAE: 26.498

1	MAE: 26.498

Summary

In this tutorial, you discovered how to use Auto-Sklearn for AutoML with Scikit-Learn machine learning algorithms in Python.

Specifically, you learned:

Auto-Sklearn is an open-source library for AutoML with scikit-learn data preparation and machine learning models.
How to use Auto-Sklearn to automatically discover top-performing models for classification tasks.
How to use Auto-Sklearn to automatically discover top-performing models for regression tasks.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

75 Responses to Auto-Sklearn for Automated Machine Learning in Python

Alessandro F. September 7, 2020 at 7:29 pm #

Hello Jason,

I’ve tried to run Auto-Sklearn on Google CO-Lab without success.
It seems there is a conflict with the current version of Pandas (1.0.x) while the package runs on a prior version (0.25.x).

I’ll try to solve this issue.
Anyway, your explanations are always the best !

Regards,
Alessandro

Jason Brownlee September 8, 2020 at 6:47 am #

Perhaps try running on your workstation.

Gábor September 8, 2020 at 8:29 pm #

I also had problem with Pandas >0.25.x. Downgrading to 0.25.3, substituting with the arff package with liac-arff fixed it.

Jason Brownlee September 9, 2020 at 6:47 am #

Here are the versions I’m using if that helps at all:

python: 3.6.12
scipy: 1.4.1
numpy: 1.18.5
matplotlib: 3.3.0
pandas: 1.1.1
statsmodels: 0.12.0
sklearn: 0.23.2
skopt: 0.8.1
autosklearn: 0.8.0
nltk: 3.5
gensim: 3.8.3
xgboost 1.0.2
lightgbm 2.3.1
catboost 0.24.1
opencv: 3.4.10
imblearn: 0.7.0
fbprophet: 0.7.1
tensorflow: 2.3.0
theano: 1.0.5
keras: 2.4.3
mtcnn: 0.1.0
keras-vggface: 0.6

python: 3.6.12

scipy: 1.4.1

numpy: 1.18.5

matplotlib: 3.3.0

pandas: 1.1.1

statsmodels: 0.12.0

sklearn: 0.23.2

skopt: 0.8.1

autosklearn: 0.8.0

nltk: 3.5

gensim: 3.8.3

xgboost 1.0.2

lightgbm 2.3.1

catboost 0.24.1

opencv: 3.4.10

imblearn: 0.7.0

fbprophet: 0.7.1

tensorflow: 2.3.0

theano: 1.0.5

keras: 2.4.3

mtcnn: 0.1.0

keras-vggface: 0.6

Junaid Akhtar September 12, 2020 at 4:09 pm #

how did you compute this list?

Jason Brownlee September 13, 2020 at 5:59 am #

I printed the version of each library in turn with this script:

# check versions of main machine learning libraries

# python
import platform
print('python: %s' % platform.python_version())
# scipy
import scipy
print('scipy: %s' % scipy.__version__)
# numpy
import numpy
print('numpy: %s' % numpy.__version__)
# matplotlib
import matplotlib
print('matplotlib: %s' % matplotlib.__version__)
# pandas
import pandas
print('pandas: %s' % pandas.__version__)
# statsmodels
import statsmodels
print('statsmodels: %s' % statsmodels.__version__)
# scikit-learn
import sklearn
print('sklearn: %s' % sklearn.__version__)

# scikit-optimize
import skopt
print('skopt: %s' % skopt.__version__)
# autosklearn
import autosklearn
print('autosklearn: %s' % autosklearn.__version__)

# nltk
import nltk
print('nltk: %s' % nltk.__version__)
# gensim
import gensim
print('gensim: %s' % gensim.__version__)
# xgboost
import xgboost
print("xgboost", xgboost.__version__)
# lightgbm
import lightgbm
print("lightgbm", lightgbm.__version__)
# catboost
import catboost
print("catboost", catboost.__version__)
# opencv
import cv2
print('opencv: %s' % cv2.__version__)
# imblearn learn
import imblearn
print('imblearn: %s' % imblearn.__version__)
# fbprophet
import fbprophet
print('fbprophet: %s' % fbprophet.__version__)

# tensorflow
import tensorflow
print('tensorflow: %s' % tensorflow.__version__)
# theano
import theano
print('theano: %s' % theano.__version__)
# keras
import keras
print('keras: %s' % keras.__version__)

# MTCNN
import mtcnn
print('mtcnn: %s' % mtcnn.__version__)
# keras-vggface
import keras_vggface
print('keras-vggface: %s' % keras_vggface.__version__)

# check versions of main machine learning libraries

# python

import platform

print('python: %s' % platform.python_version())

# scipy

import scipy

print('scipy: %s' % scipy.__version__)

# numpy

import numpy

print('numpy: %s' % numpy.__version__)

# matplotlib

import matplotlib

print('matplotlib: %s' % matplotlib.__version__)

# pandas

import pandas

print('pandas: %s' % pandas.__version__)

# statsmodels

import statsmodels

print('statsmodels: %s' % statsmodels.__version__)

# scikit-learn

import sklearn

print('sklearn: %s' % sklearn.__version__)

# scikit-optimize

import skopt

print('skopt: %s' % skopt.__version__)

# autosklearn

import autosklearn

print('autosklearn: %s' % autosklearn.__version__)

# nltk

import nltk

print('nltk: %s' % nltk.__version__)

# gensim

import gensim

print('gensim: %s' % gensim.__version__)

# xgboost

import xgboost

print("xgboost", xgboost.__version__)

# lightgbm

import lightgbm

print("lightgbm", lightgbm.__version__)

# catboost

import catboost

print("catboost", catboost.__version__)

# opencv

import cv2

print('opencv: %s' % cv2.__version__)

# imblearn learn

import imblearn

print('imblearn: %s' % imblearn.__version__)

# fbprophet

import fbprophet

print('fbprophet: %s' % fbprophet.__version__)

# tensorflow

import tensorflow

print('tensorflow: %s' % tensorflow.__version__)

# theano

import theano

print('theano: %s' % theano.__version__)

# keras

import keras

print('keras: %s' % keras.__version__)

# MTCNN

import mtcnn

print('mtcnn: %s' % mtcnn.__version__)

# keras-vggface

import keras_vggface

print('keras-vggface: %s' % keras_vggface.__version__)

Gabriel September 8, 2020 at 3:49 am #

Is it possible to evaluate the automatically selected model by hand? I mean, is there a method to call for all the parameters and configuration set to the model at the end?

Reply
- Jason Brownlee September 8, 2020 at 6:52 am #
  
  Yes, once defined you can evaluate it on your own test harness.
  
  Reply
shamik September 8, 2020 at 11:25 am #

This is such informative information. Really very useful. I wanted to use this in my live project but could not move ahead.
This package does not seem to be supported by windows as per their official website. Neither could find a way out for Anaconda which I am using at present.
This command mentioned above does not work in Anaconda:
sudo pip install autosklearn

Kindly help how we can use it in Anaconda env.

Reply
- Jason Brownlee September 8, 2020 at 1:36 pm #
  
  You’re welcome.
  
  Sorry to hear that. Perhaps try using conda to install the package?
  
  Reply
Manoj Ishi September 8, 2020 at 7:06 pm #

TypeError: ‘generator’ object is not subscribable
error is occurred while running the classification problem

Reply
- Jason Brownlee September 9, 2020 at 6:46 am #
  
  This may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
- Patrick Ryan September 14, 2020 at 11:45 am #
  
  I saw the same thing. According to this GitHub issues:
  
  https://github.com/automl/auto-sklearn/issues/380
  
  you should
  
  pip install liac-arff
  
  first remove arff:
  
  pip uninstall arff
  
  then
  
  pip install liac-arff
  
  Reply
  - Jason Brownlee September 14, 2020 at 2:45 pm #
    
    Thanks for sharing.
    
    Reply
Gábor Stikkel September 9, 2020 at 4:55 am #

Thank you for yet another great article Jason!
I got an error for the regression part:

—————————————————————————
TypeError Traceback (most recent call last)
in
17 model = AutoSklearnRegressor(time_left_for_this_task=60, per_run_time_limit=30, n_jobs=8)#, metric=auto_mean_absolute_error)
18 # perform the search
—> 19 model.fit(X_train, y_train, metric=auto_mean_absolute_error)
20 # summarize
21 print(model.sprint_statistics())

TypeError: fit() got an unexpected keyword argument ‘metric’

I think the metric part shall be moved to the model definition (see commmented part above).

[Keep rocking!]

Reply
- Jason Brownlee September 9, 2020 at 6:53 am #
  
  Sorry to hear that, perhaps check that all of your libraries are up to date?
  
  Reply
  - Adeliya September 10, 2020 at 11:26 pm #
    
    Hi!
    
    I got the same error. And actually looking at the documentation “metric” is a parameter of AutoSklearnRegressor(), not fit().
    
    Thanks for tutorial!
    
    Reply
    - Jason Brownlee September 11, 2020 at 5:58 am #
      
      Nice work.
      
      Reply
Bill September 11, 2020 at 5:41 am #

Thanks for the tutorial,

We are really keen to see some tutorials on Knowledge Graphs and Knowledge Graph Embeddings 🙂

Thanks,
Bill

Reply
- Jason Brownlee September 11, 2020 at 6:04 am #
  
  Thanks Bill.
  
  Reply
Timothy September 11, 2020 at 8:02 am #

Looks like just what I need. Unfortunately, install was not successful. Will have to revisit when I have lots of time to troubleshoot.

Reply
- Jason Brownlee September 11, 2020 at 1:28 pm #
  
  Sorry to hear that.
  
  Reply
Xu Zhang September 11, 2020 at 8:25 am #

Thank you for your great post.

Does Auto-Sklearn always got the better performance compared to the fine-tuned individual models?

Reply
- Jason Brownlee September 11, 2020 at 1:28 pm #
  
  No, but it can find a good model quickly.
  
  Reply
ct dyana September 11, 2020 at 8:47 am #

Tq for the informative explanation…I like it much…

but I am interested to know what algorithm does python used in auto sklearn for ml? Does it perform as same as using weka..
How I want to know the algorithm for classification and regression that applicable in lib python?

Tqvm

Reply
- Jason Brownlee September 11, 2020 at 1:29 pm #
  
  You’re welcome.
  
  It uses Bayesian Optimization to search for an appropriate algorithm.
  
  Reply
Amit Barik September 11, 2020 at 4:13 pm #

Hi Jason,

I have windows10 laptop. When i tried installing auto-sklearn, it said it couldn’t becoz of system compatibility issues. Then i checked in git and got to know that we can’t install in windows machine. Am i right? Does it run only on unix machines?

Please help!!!

Regards,
Amit

Reply
- Jason Brownlee September 12, 2020 at 6:05 am #
  
  Sorry, I don’t know about windows machines.
  
  Reply
- Hemanth Varma September 15, 2020 at 3:16 am #
  
  It is not possible to run auto-sklearn on a Windows machine.
  
  https://automl.github.io/auto-sklearn/master/installation.html
  
  Reply
  - Jason Brownlee September 15, 2020 at 5:28 am #
    
    Perhaps try a virtual machine?
    
    Reply
Santosh September 11, 2020 at 6:06 pm #

Can we achive the parameters what was internally used in to get the best results.

Reply
- Jason Brownlee September 12, 2020 at 6:05 am #
  
  Yes, the best model include the hyperparameters used.
  
  Reply
Mark Littlewood September 11, 2020 at 6:30 pm #

How do you know what algo it has selected eg GBM or is it an ensemble

Reply
- Jason Brownlee September 12, 2020 at 6:06 am #
  
  You can use model.show_models() to show the ensemble of models.
  
  Reply
Anubha Pearline S September 11, 2020 at 8:25 pm #

Hi Jason,
How to check which model is chosen thro AutoSklearn?

Reply
- Jason Brownlee September 12, 2020 at 6:11 am #
  
  You can use the following to show the models in the final ensemble:
  
  ... print(model.show_models())
  
  1
  2
  
  ...
  print(model.show_models())
  
  Reply
shaheen mohammed saleh September 12, 2020 at 1:12 am #

Please how to install auto-sklearn in anaconda windows 10 because auto-sklearn has the following system requirements:

Linux operating system (for example Ubuntu) (get Linux here),

Python (>=3.5) (get Python here).

C++ compiler (with C++11 supports) (get GCC here) and

SWIG (version 3.0.* is required; >=4.0.0 is not supported) (get SWIG here).

Reply
- Jason Brownlee September 12, 2020 at 6:18 am #
  
  Perhaps use an AWS EC2 instance or a linux virtual machine.
  
  Reply
Rahul Goswami September 12, 2020 at 2:51 am #

For everyone facing issues installing auto sklearn on windows in Google Colab, should run on jupyter as well but have not tried out.Run the below codes and auto sklearn runs without issue

!apt-get install swig -y
!pip install Cython numpy

# sometimes you have to run the next command twice on colab
# I haven’t figured out why
!pip install auto-sklearn

Reply
- Jason Brownlee September 12, 2020 at 6:19 am #
  
  Thanks for sharing.
  
  Reply
Rahul Goswami September 12, 2020 at 3:14 am #

Hi Jason,

Gettting this error when trying out the classifier for auto sklearn

ypeError: ‘generator’ object is not subscribable
error is occurred while running the classification problem

Kindly help if possible and thanks for all the great blogs.

Reply
- Jason Brownlee September 12, 2020 at 6:20 am #
  
  Sorry, I have not seen this error. Perhaps try searching/posting on stackoverflow.
  
  Reply
TC September 18, 2020 at 2:04 pm #

Hi Jason,

Thanks for writing this blog as there are very fews articles online covering auto-sklearn.

A question: does auto-sklearn really offer any “feature engineering” stuff? It look like feature preprocessors just do dimension reduction or compression. For example, if we have a datetime variable, can auto-sklearn automatically create hour, day, month, year engineered features out of this variable?

Reply
- Jason Brownlee September 18, 2020 at 2:52 pm #
  
  Not really. You could add those modules.
  
  Reply
  - TC September 18, 2020 at 6:06 pm #
    
    Could you provide an example of how to “add those modules”?
    
    Reply
    - Jason Brownlee September 19, 2020 at 6:48 am #
      
      Thanks for the suggestion, perhaps in the future.
      
      Reply
TC September 18, 2020 at 6:13 pm #

Hi Jason,

“Using a test harness of repeated stratified 10-fold cross-validation with three repeats, a naive model can achieve an accuracy of about 53 percent. A top-performing model can achieve accuracy on this same test harness of about 88 percent. This provides the bounds of expected performance on this dataset.”

Could you tell me what algorithms did you use to get the naive and top-performing models respectively?

Reply
- Jason Brownlee September 19, 2020 at 6:48 am #
  
  You can see here:
  https://machinelearningmastery.com/results-for-standard-classification-and-regression-machine-learning-datasets/
  
  Reply
TC September 18, 2020 at 6:15 pm #

Hi Jason,

“At the end of the run, a summary is printed showing that 1,759 models were evaluated and the estimated performance of the final model was a MAE of 29.”

Is there any way to see what algorithms auto-sklearn uses to generate these 1,759 models? Does auto-sklearn include xgboost as one of the algorithms to build models?

Reply
- Jason Brownlee September 19, 2020 at 6:49 am #
  
  Yes, as follows:
  
  ... print(model.show_models())
  
  1
  2
  
  ...
  print(model.show_models())
  
  I don’t know if it supports xgboost off hand, sorry.
  
  Reply
Gustavo September 22, 2020 at 12:16 pm #

Does it make Cross validation to choose best model?

Reply
- Jason Brownlee September 22, 2020 at 1:39 pm #
  
  Good question, I’m not sure off the cuff.
  
  Perhaps check the library documentaiton?
  
  Reply
Zineb September 29, 2020 at 6:20 am #

Hi Jason, Simplified and useful as usual. Thanks a lot.
I feel a bit frustrated because by using Automated ML I feel like no need no more to waste time diving into the different steps to preprocess data and testing different techniques to build a good model.
What do you think?

Reply
- Jason Brownlee September 29, 2020 at 7:43 am #
  
  I think AutoML is great for a quick model or to get a quick idea of what works.
  
  We can still get better/best results from hand crafted models. Case in point are ML competitions. As soon as competitions are consistently won by AutoML, it’s time to move up the stack.
  
  Reply
josh October 9, 2020 at 4:20 pm #

Hi Jason, I tried the same code in google colab, but generate the error as below:

model.fit(X_train,y_train)
TypeError: __init__() got an unexpected keyword argument ‘local_directory’

Could you please help? Thanks!

Reply
- Jason Brownlee October 10, 2020 at 6:59 am #
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/do-code-examples-run-on-google-colab
  
  Reply
  - Ketan Mahajan May 7, 2022 at 9:15 pm #
    
    There are so many things how can we find it.
    
    Reply
    - James Carmichael May 8, 2022 at 9:55 am #
      
      Hi Ketan…Please clarify your question so that we can better assist you.
      
      Reply
Dominick October 21, 2020 at 6:51 am #

Hello.This article was really fascinating, particularly since
I was looking for thoughts on this subject last Sunday.

Reply
- Jason Brownlee October 21, 2020 at 7:49 am #
  
  Thanks!
  
  Reply
Klemen January 7, 2021 at 7:55 am #

Very interesting. I will try this out. Does it also perform some feature selection? And does it preprocess input data (normalization, categorical values – one hot encoding)? Or is all of that still part of manual preprocessing?

Reply
- Jason Brownlee January 7, 2021 at 9:41 am #
  
  Thanks.
  
  Good question, I believe it does involve selecting some data prep.
  
  I’d recommend double checking the documentation.
  
  Reply
frederic kleinemann February 24, 2021 at 1:37 am #

Hello ,

I am trying to run the example for the AutoSklearn for classification example using the sonar.csv dataset and each time I have this error : EOFError : unexpected EOF.

~/anaconda3/lib/python3.8/multiprocessing/forkserver.py in read_signed(fd)
332 s = os.read(fd, length – len(data))
333 if not s:
–> 334 raise EOFError(‘unexpected EOF’)
335 data += s
336 return SIGNED_STRUCT.unpack(data)[0]

EOFError: unexpected EOF

I am using autosklearn : 0.12.3 and I have tried all the example from the AutoSklearn and they work well. So I was wondering if you have encountered that problem and if so how did you solve it.

thank you

Reply
- Jason Brownlee February 24, 2021 at 5:34 am #
  
  Sorry, I have not seen this problem.
  
  Perhaps try posting code/error on stackoverflow or an issue on the autosklaern project itself?
  
  Reply
Kai-Yun Li March 17, 2021 at 5:34 am #

Hi Jason, I’m using the Auto-Sklearn for the classification task, and it runs well,
however, it doesn’t offer too many visualization examples,
I was wondering if I can import wandb to deal with such a task?

https://wandb.ai/lavanyashukla/visualize-sklearn/reports/Visualize-Scikit-Models–Vmlldzo0ODIzNg

If not, is it any other way that can show better plots?

Thank you

Reply
- Jason Brownlee March 17, 2021 at 6:10 am #
  
  Perhaps once the search finds a final model you can re-fit it and visualize it.
  
  Reply
JG April 28, 2021 at 4:41 am #

Hi Jason,

thank you for the post. It seems very interested.!

While trying to install autosklearn on my Mac with python 3.6 (installed following your post:
https://machinelearningmastery.com/install-python-3-environment-mac-os-x-machine-learning-deep-learning/)

with your command
% sudo pip install autosklearn

I got he following error:
ERROR: Could not find a version that satisfies the requirement autosklearn (from versions: none)
ERROR: No matching distribution found for autosklearn

I tried other options, following autosklearn suggestions :
% curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip3 install
% sudo pip3 install -U auto-sklearn

but also It does not work 🙁

any suggestion?
thanks

Reply
- Jason Brownlee April 28, 2021 at 6:03 am #
  
  Ouch!?
  
  Sorry to hear it, perhaps the lib has not been updated recently to keep track of sklearn.
  
  Perhaps a new env is required with some versions (maybe sklearn) rolled back?
  
  Reply
Vinit August 16, 2021 at 10:31 pm #

Hi jason,

Thanks for the wonderful article.Learnt a lot, however one doubt I have in mind:

For a binary classification , does auto sklearn classifier takes the probability threshold of 0.5 by default? Can it be changed?

Reply
- Adrian Tam August 17, 2021 at 7:49 am #
  
  It can be changed of course. That is the predict_proba() function of the classifier. However, 0.5 threshold makes mathematical sense because inverting the result give exact opposite for binary classification. That’s why it is the default.
  
  Reply
Mary December 22, 2021 at 11:18 am #

Hello,

Thank you for your post.

However, how can we report what is the selected model and its parameters? Also, can we select the metric we want to use for evaluation in the search process?

Thank you.

Reply
- James Carmichael February 28, 2022 at 12:19 pm #
  
  Hi Mary…The following is a great discussion of this concept:
  
  https://github.com/automl/auto-sklearn/issues/872
  
  Reply
Mary December 22, 2021 at 11:18 am #

Hello,

Thank you for your post.

However, how can we report what is the selected model and its parameters? Also, can we select the metric we want to use for evaluation in the search process?

Thank you.

Reply
Sushant Pawar May 4, 2023 at 4:14 am #

Great article man!
Can you give some tips on how to avoid overfitting?

Reply
- James Carmichael May 4, 2023 at 6:30 am #
  
  Hi Sushant…The following resource may be of interest to you:
  
  https://machinelearningmastery.com/introduction-to-regularization-to-reduce-overfitting-and-improve-generalization-error/
  
  Reply

Navigation

Auto-Sklearn for Automated Machine Learning in Python

Tutorial Overview

AutoML With Auto-Sklearn

Install and Using Auto-Sklearn

Auto-Sklearn for Classification

Auto-Sklearn for Regression

Further Reading

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

75 Responses to Auto-Sklearn for Automated Machine Learning in Python

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

AutoML With Auto-Sklearn

Install and Using Auto-Sklearn

Auto-Sklearn for Classification

Auto-Sklearn for Regression

Further Reading

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

75 Responses to Auto-Sklearn for Automated Machine Learning in Python

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects