The post How to Develop a Reusable Framework to Spot-Check Algorithms in Python appeared first on Machine Learning Mastery.

]]>Unlike grid searching and other types of algorithm tuning that seek the optimal algorithm or optimal configuration for an algorithm, spot-checking is intended to evaluate a diverse set of algorithms rapidly and provide a rough first-cut result. This first cut result may be used to get an idea if a problem or problem representation is indeed predictable, and if so, the types of algorithms that may be worth investigating further for the problem.

Spot-checking is an approach to help overcome the “hard problem” of applied machine learning and encourage you to clearly think about the higher-order search problem being performed in any machine learning project.

In this tutorial, you will discover the usefulness of spot-checking algorithms on a new predictive modeling problem and how to develop a standard framework for spot-checking algorithms in python for classification and regression problems.

After completing this tutorial, you will know:

- Spot-checking provides a way to quickly discover the types of algorithms that perform well on your predictive modeling problem.
- How to develop a generic framework for loading data, defining models, evaluating models, and summarizing results.
- How to apply the framework for classification and regression problems.

Let’s get started.

This tutorial is divided into five parts; they are:

- Spot-Check Algorithms
- Spot-Checking Framework in Python
- Spot-Checking for Classification
- Spot-Checking for Regression
- Framework Extension

We cannot know beforehand what algorithms will perform well on a given predictive modeling problem.

This is the hard part of applied machine learning that can only be resolved via systematic experimentation.

Spot-checking is an approach to this problem.

It involves rapidly testing a large suite of diverse machine learning algorithms on a problem in order to quickly discover what algorithms might work and where to focus attention.

**It is fast**; it by-passes the days or weeks of preparation and analysis and playing with algorithms that may not ever lead to a result.**It is objective**, allowing you to discover what might work well for a problem rather than going with what you used last time.**It gets results**; you will actually fit models, make predictions and know if your problem can be predicted and what baseline skill may look like.

Spot-checking may require that you work with a small sample of your dataset in order to turn around results quickly.

Finally, the results from spot checking are a jumping-off point. A starting point. They suggest where to focus attention on the problem, not what the best algorithm might be. The process is designed to shake you out of typical thinking and analysis and instead focus on results.

You can learn more about spot-checking in the post:

Now that we know what spot-checking is, let’s look at how we can systematically perform spot-checking in Python.

In this section we will build a framework for a script that can be used for spot-checking machine learning algorithms on a classification or regression problem.

There are four parts to the framework that we need to develop; they are:

- Load Dataset
- Define Models
- Evaluate Models
- Summarize Results

Let’s take a look at each in turn.

The first step of the framework is to load the data.

The function must be implemented for a given problem and be specialized to that problem. It will likely involve loading data from one or more CSV files.

We will call this function *load_data()*; it will take no arguments and return the inputs (*X*) and outputs (*y*) for the prediction problem.

# load the dataset, returns X and y elements def load_dataset(): X, y = None, None return X, y

The next step is to define the models to evaluate on the predictive modeling problem.

The models defined will be specific to the type predictive modeling problem, e.g. classification or regression.

The defined models should be diverse, including a mixture of:

- Linear Models.
- Nonlinear Models.
- Ensemble Models.

Each model should be a given a good chance to perform well on the problem. This might be mean providing a few variations of the model with different common or well known configurations that perform well on average.

We will call this function *define_models()*. It will return a dictionary of model names mapped to scikit-learn model object. The name should be short, like ‘*svm*‘ and may include a configuration detail, e.g. ‘knn-7’.

The function will also take a dictionary as an optional argument; if not provided, a new dictionary is created and populated. If a dictionary is provided, models are added to it.

This is to add flexibility if you would like to have multiple functions for defining models, or add a large number of models of a specific type with different configurations.

# create a dict of standard models to evaluate {name:object} def define_models(models=dict()): # ... return models

The idea is not to grid search model parameters; that can come later.

Instead, each model should be given an opportunity to perform well (i.e. not optimally). This might mean trying many combinations of parameters in some cases, e.g. in the case of gradient boosting.

The next step is the evaluation of the defined models on the loaded dataset.

The scikit-learn library provides the ability to pipeline models during evaluation. This allows the data to be transformed prior to being used to fit a model, and this is done in a correct way such that the transforms are prepared on the training data and applied to the test data.

We can define a function that prepares a given model prior to evaluation to allow specific transforms to be used during the spot checking process. They will be performed in a blanket way to all models. This can be useful to perform operations such as standardization, normalization, and feature selection.

We will define a function named *make_pipeline()* that takes a defined model and returns a pipeline. Below is an example of preparing a pipeline that will first standardize the input data, then normalize it prior to fitting the model.

# create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline

This function can be expanded to add other transforms, or simplified to return the provided model with no transforms.

Now we need to evaluate a prepared model.

We will use a standard of evaluating models using k-fold cross-validation. The evaluation of each defined model will result in a list of results. This is because 10 different versions of the model will have been fit and evaluated, resulting in a list of k scores.

We will define a function named *evaluate_model()* that will take the data, a defined model, a number of folds, and a performance metric used to evaluate the results. It will return the list of scores.

The function calls *make_pipeline()* for the defined model to prepare any data transforms required, then calls the cross_val_score() scikit-learn function. Importantly, the *n_jobs* argument is set to -1 to allow the model evaluations to occur in parallel, harnessing as many cores as you have available on your hardware.

# evaluate a single model def evaluate_model(X, y, model, folds, metric): # create the pipeline pipeline = make_pipeline(model) # evaluate model scores = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) return scores

It is possible for the evaluation of a model to fail with an exception. I have seen this especially in the case of some models from the statsmodels library.

It is also possible for the evaluation of a model to result in a lot of warning messages. I have seen this especially in the case of using XGBoost models.

We do not care about exceptions or warnings when spot checking. We only want to know what does work and what works well. Therefore, we can trap exceptions and ignore all warnings when evaluating each model.

The function named *robust_evaluate_model()* implements this behavior. The *evaluate_model()* is called in a way that traps exceptions and ignores warnings. If an exception occurs and no result was possible for a given model, a *None* result is returned.

# evaluate a model and try to trap errors and and hide warnings def robust_evaluate_model(X, y, model, folds, metric): scores = None try: with warnings.catch_warnings(): warnings.filterwarnings("ignore") scores = evaluate_model(X, y, model, folds, metric) except: scores = None return scores

Finally, we can define the top-level function for evaluating the list of defined models.

We will define a function named *evaluate_models()* that takes the dictionary of models as an argument and returns a dictionary of model names to lists of results.

The number of folds in the cross-validation process can be specified by an optional argument that defaults to 10. The metric calculated on the predictions from the model can also be specified by an optional argument and defaults to classification accuracy.

For a full list of supported metrics, see this list:

Any None results are skipped and not added to the dictionary of results.

Importantly, we provide some verbose output, summarizing the mean and standard deviation of each model after it was evaluated. This is helpful if the spot checking process on your dataset takes minutes to hours.

# evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, folds=10, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, metric) # show process if scores is not None: # store a result results[name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (name, mean_score, std_score)) else: print('>%s: error' % name) return results

Note that if for some reason you want to see warnings and errors, you can update the *evaluate_models()* to call the *evaluate_model()* function directly, by-passing the robust error handling. I find this useful when testing out new methods or method configurations that fail silently.

Finally, we can evaluate the results.

Really, we only want to know what algorithms performed well.

Two useful ways to summarize the results are:

- Line summaries of the mean and standard deviation of the top 10 performing algorithms.
- Box and whisker plots of the top 10 performing algorithms.

The line summaries are quick and precise, although assume a well behaving Gaussian distribution, which may not be reasonable.

The box and whisker plots assume no distribution and provide a visual way to directly compare the distribution of scores across models in terms of median performance and spread of scores.

We will define a function named *summarize_results()* that takes the dictionary of results, prints the summary of results, and creates a boxplot image that is saved to file. The function takes an argument to specify if the evaluation score is maximizing, which by default is *True*. The number of results to summarize can also be provided as an optional parameter, which defaults to 10.

The function first orders the scores before printing the summary and creating the box and whisker plot.

# print and plot the top n results def summarize_results(results, maximize=True, top_n=10): # check for no results if len(results) == 0: print('no results') return # determine how many results to summarize n = min(top_n, len(results)) # create a list of (name, mean(scores)) tuples mean_scores = [(k,mean(v)) for k,v in results.items()] # sort tuples by mean score mean_scores = sorted(mean_scores, key=lambda x: x[1]) # reverse for descending order (e.g. for accuracy) if maximize: mean_scores = list(reversed(mean_scores)) # retrieve the top n for summarization names = [x[0] for x in mean_scores[:n]] scores = [results[x[0]] for x in mean_scores[:n]] # print the top n print() for i in range(n): name = names[i] mean_score, std_score = mean(results[name]), std(results[name]) print('Rank=%d, Name=%s, Score=%.3f (+/- %.3f)' % (i+1, name, mean_score, std_score)) # boxplot for the top n pyplot.boxplot(scores, labels=names) _, labels = pyplot.xticks() pyplot.setp(labels, rotation=90) pyplot.savefig('spotcheck.png')

Now that we have specialized a framework for spot-checking algorithms in Python, let’s look at how we can apply it to a classification problem.

We will generate a binary classification problem using the make_classification() function.

The function will generate 1,000 samples with 20 variables, with some redundant variables and two classes.

# load the dataset, returns X and y elements def load_dataset(): return make_classification(n_samples=1000, n_classes=2, random_state=1)

As a classification problem, we will try a suite of classification algorithms, specifically:

- Logistic Regression
- Ridge Regression
- Stochastic Gradient Descent Classifier
- Passive Aggressive Classifier

I tried LDA and QDA, but they sadly crashed down in the C-code somewhere.

- k-Nearest Neighbors
- Classification and Regression Trees
- Extra Tree
- Support Vector Machine
- Naive Bayes

- AdaBoost
- Bagged Decision Trees
- Random Forest
- Extra Trees
- Gradient Boosting Machine

Further, I added multiple configurations for a few of the algorithms like Ridge, kNN, and SVM in order to give them a good chance on the problem.

The full *define_models()* function is listed below.

# create a dict of standard models to evaluate {name:object} def define_models(models=dict()): # linear models models['logistic'] = LogisticRegression() alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['ridge-'+str(a)] = RidgeClassifier(alpha=a) models['sgd'] = SGDClassifier(max_iter=1000, tol=1e-3) models['pa'] = PassiveAggressiveClassifier(max_iter=1000, tol=1e-3) # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsClassifier(n_neighbors=k) models['cart'] = DecisionTreeClassifier() models['extra'] = ExtraTreeClassifier() models['svml'] = SVC(kernel='linear') models['svmp'] = SVC(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVC(C=c) models['bayes'] = GaussianNB() # ensemble models n_trees = 100 models['ada'] = AdaBoostClassifier(n_estimators=n_trees) models['bag'] = BaggingClassifier(n_estimators=n_trees) models['rf'] = RandomForestClassifier(n_estimators=n_trees) models['et'] = ExtraTreesClassifier(n_estimators=n_trees) models['gbm'] = GradientBoostingClassifier(n_estimators=n_trees) print('Defined %d models' % len(models)) return models

That’s it; we are now ready to spot check algorithms on the problem.

The complete example is listed below.

# binary classification spot check script import warnings from numpy import mean from numpy import std from matplotlib import pyplot from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import ExtraTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier # load the dataset, returns X and y elements def load_dataset(): return make_classification(n_samples=1000, n_classes=2, random_state=1) # create a dict of standard models to evaluate {name:object} def define_models(models=dict()): # linear models models['logistic'] = LogisticRegression() alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['ridge-'+str(a)] = RidgeClassifier(alpha=a) models['sgd'] = SGDClassifier(max_iter=1000, tol=1e-3) models['pa'] = PassiveAggressiveClassifier(max_iter=1000, tol=1e-3) # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsClassifier(n_neighbors=k) models['cart'] = DecisionTreeClassifier() models['extra'] = ExtraTreeClassifier() models['svml'] = SVC(kernel='linear') models['svmp'] = SVC(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVC(C=c) models['bayes'] = GaussianNB() # ensemble models n_trees = 100 models['ada'] = AdaBoostClassifier(n_estimators=n_trees) models['bag'] = BaggingClassifier(n_estimators=n_trees) models['rf'] = RandomForestClassifier(n_estimators=n_trees) models['et'] = ExtraTreesClassifier(n_estimators=n_trees) models['gbm'] = GradientBoostingClassifier(n_estimators=n_trees) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # evaluate a single model def evaluate_model(X, y, model, folds, metric): # create the pipeline pipeline = make_pipeline(model) # evaluate model scores = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) return scores # evaluate a model and try to trap errors and and hide warnings def robust_evaluate_model(X, y, model, folds, metric): scores = None try: with warnings.catch_warnings(): warnings.filterwarnings("ignore") scores = evaluate_model(X, y, model, folds, metric) except: scores = None return scores # evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, folds=10, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, metric) # show process if scores is not None: # store a result results[name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (name, mean_score, std_score)) else: print('>%s: error' % name) return results # print and plot the top n results def summarize_results(results, maximize=True, top_n=10): # check for no results if len(results) == 0: print('no results') return # determine how many results to summarize n = min(top_n, len(results)) # create a list of (name, mean(scores)) tuples mean_scores = [(k,mean(v)) for k,v in results.items()] # sort tuples by mean score mean_scores = sorted(mean_scores, key=lambda x: x[1]) # reverse for descending order (e.g. for accuracy) if maximize: mean_scores = list(reversed(mean_scores)) # retrieve the top n for summarization names = [x[0] for x in mean_scores[:n]] scores = [results[x[0]] for x in mean_scores[:n]] # print the top n print() for i in range(n): name = names[i] mean_score, std_score = mean(results[name]), std(results[name]) print('Rank=%d, Name=%s, Score=%.3f (+/- %.3f)' % (i+1, name, mean_score, std_score)) # boxplot for the top n pyplot.boxplot(scores, labels=names) _, labels = pyplot.xticks() pyplot.setp(labels, rotation=90) pyplot.savefig('spotcheck.png') # load dataset X, y = load_dataset() # get model list models = define_models() # evaluate models results = evaluate_models(X, y, models) # summarize results summarize_results(results)

Running the example prints one line per evaluated model, ending with a summary of the top 10 performing algorithms on the problem.

We can see that ensembles of decision trees performed the best for this problem. This suggests a few things:

- Ensembles of decision trees might be a good place to focus attention.
- Gradient boosting will likely do well if further tuned.
- A “good” performance on the problem is about 86% accuracy.
- The relatively high performance of ridge regression suggests the need for feature selection.

... >bag: 0.862 (+/-0.034) >rf: 0.865 (+/-0.033) >et: 0.858 (+/-0.035) >gbm: 0.867 (+/-0.044) Rank=1, Name=gbm, Score=0.867 (+/- 0.044) Rank=2, Name=rf, Score=0.865 (+/- 0.033) Rank=3, Name=bag, Score=0.862 (+/- 0.034) Rank=4, Name=et, Score=0.858 (+/- 0.035) Rank=5, Name=ada, Score=0.850 (+/- 0.035) Rank=6, Name=ridge-0.9, Score=0.848 (+/- 0.038) Rank=7, Name=ridge-0.8, Score=0.848 (+/- 0.038) Rank=8, Name=ridge-0.7, Score=0.848 (+/- 0.038) Rank=9, Name=ridge-0.6, Score=0.848 (+/- 0.038) Rank=10, Name=ridge-0.5, Score=0.848 (+/- 0.038)

A box and whisker plot is also created to summarize the results of the top 10 well performing algorithms.

The plot shows the elevation of the methods comprised of ensembles of decision trees. The plot enforces the notion that further attention on these methods would be a good idea.

If this were a real classification problem, I would follow-up with further spot checks, such as:

- Spot check with various different feature selection methods.
- Spot check without data scaling methods.
- Spot check with a course grid of configurations for gradient boosting in sklearn or XGBoost.

Next, we will see how we can apply the framework to a regression problem.

We can explore the same framework for regression predictive modeling problems with only very minor changes.

We can use the make_regression() function to generate a contrived regression problem with 1,000 examples and 50 features, some of them redundant.

The defined *load_dataset()* function is listed below.

# load the dataset, returns X and y elements def load_dataset(): return make_regression(n_samples=1000, n_features=50, noise=0.1, random_state=1)

We can then specify a *get_models()* function that defines a suite of regression methods.

Scikit-learn does offer a wide range of linear regression methods, which is excellent. Not all of them may be required on your problem. I would recommend a minimum of linear regression and elastic net, the latter with a good suite of alpha and lambda parameters.

Nevertheless, we will test the full suite of methods on this problem, including:

- Linear Regression
- Lasso Regression
- Ridge Regression
- Elastic Net Regression
- Huber Regression
- LARS Regression
- Lasso LARS Regression
- Passive Aggressive Regression
- RANSAC Regressor
- Stochastic Gradient Descent Regression
- Theil Regression

- k-Nearest Neighbors
- Classification and Regression Tree
- Extra Tree
- Support Vector Regression

- AdaBoost
- Bagged Decision Trees
- Random Forest
- Extra Trees
- Gradient Boosting Machines

The full *get_models()* function is listed below.

# create a dict of standard models to evaluate {name:object} def get_models(models=dict()): # linear models models['lr'] = LinearRegression() alpha = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['lasso-'+str(a)] = Lasso(alpha=a) for a in alpha: models['ridge-'+str(a)] = Ridge(alpha=a) for a1 in alpha: for a2 in alpha: name = 'en-' + str(a1) + '-' + str(a2) models[name] = ElasticNet(a1, a2) models['huber'] = HuberRegressor() models['lars'] = Lars() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['ranscac'] = RANSACRegressor() models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) models['theil'] = TheilSenRegressor() # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsRegressor(n_neighbors=k) models['cart'] = DecisionTreeRegressor() models['extra'] = ExtraTreeRegressor() models['svml'] = SVR(kernel='linear') models['svmp'] = SVR(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVR(C=c) # ensemble models n_trees = 100 models['ada'] = AdaBoostRegressor(n_estimators=n_trees) models['bag'] = BaggingRegressor(n_estimators=n_trees) models['rf'] = RandomForestRegressor(n_estimators=n_trees) models['et'] = ExtraTreesRegressor(n_estimators=n_trees) models['gbm'] = GradientBoostingRegressor(n_estimators=n_trees) print('Defined %d models' % len(models)) return models

By default, the framework uses classification accuracy as the method for evaluating model predictions.

This does not make sense for regression, and we can change this something more meaningful for regression, such as mean squared error. We can do this by passing the *metric=’neg_mean_squared_error’* argument when calling *evaluate_models()* function.

# evaluate models results = evaluate_models(models, metric='neg_mean_squared_error')

Note that by default scikit-learn inverts error scores so that that are maximizing instead of minimizing. This is why the mean squared error is negative and will have a negative sign when summarized. Because the score is inverted, we can continue to assume that we are maximizing scores in the *summarize_results()* function and do not need to specify *maximize=False* as we might expect when using an error metric.

The complete code example is listed below.

# regression spot check script import warnings from numpy import mean from numpy import std from matplotlib import pyplot from sklearn.datasets import make_regression from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.linear_model import Lasso from sklearn.linear_model import Ridge from sklearn.linear_model import ElasticNet from sklearn.linear_model import HuberRegressor from sklearn.linear_model import Lars from sklearn.linear_model import LassoLars from sklearn.linear_model import PassiveAggressiveRegressor from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import SGDRegressor from sklearn.linear_model import TheilSenRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.tree import ExtraTreeRegressor from sklearn.svm import SVR from sklearn.ensemble import AdaBoostRegressor from sklearn.ensemble import BaggingRegressor from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import ExtraTreesRegressor from sklearn.ensemble import GradientBoostingRegressor # load the dataset, returns X and y elements def load_dataset(): return make_regression(n_samples=1000, n_features=50, noise=0.1, random_state=1) # create a dict of standard models to evaluate {name:object} def get_models(models=dict()): # linear models models['lr'] = LinearRegression() alpha = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['lasso-'+str(a)] = Lasso(alpha=a) for a in alpha: models['ridge-'+str(a)] = Ridge(alpha=a) for a1 in alpha: for a2 in alpha: name = 'en-' + str(a1) + '-' + str(a2) models[name] = ElasticNet(a1, a2) models['huber'] = HuberRegressor() models['lars'] = Lars() models['llars'] = LassoLars() models['pa'] = PassiveAggressiveRegressor(max_iter=1000, tol=1e-3) models['ranscac'] = RANSACRegressor() models['sgd'] = SGDRegressor(max_iter=1000, tol=1e-3) models['theil'] = TheilSenRegressor() # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsRegressor(n_neighbors=k) models['cart'] = DecisionTreeRegressor() models['extra'] = ExtraTreeRegressor() models['svml'] = SVR(kernel='linear') models['svmp'] = SVR(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVR(C=c) # ensemble models n_trees = 100 models['ada'] = AdaBoostRegressor(n_estimators=n_trees) models['bag'] = BaggingRegressor(n_estimators=n_trees) models['rf'] = RandomForestRegressor(n_estimators=n_trees) models['et'] = ExtraTreesRegressor(n_estimators=n_trees) models['gbm'] = GradientBoostingRegressor(n_estimators=n_trees) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # evaluate a single model def evaluate_model(X, y, model, folds, metric): # create the pipeline pipeline = make_pipeline(model) # evaluate model scores = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) return scores # evaluate a model and try to trap errors and and hide warnings def robust_evaluate_model(X, y, model, folds, metric): scores = None try: with warnings.catch_warnings(): warnings.filterwarnings("ignore") scores = evaluate_model(X, y, model, folds, metric) except: scores = None return scores # evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, folds=10, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, metric) # show process if scores is not None: # store a result results[name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (name, mean_score, std_score)) else: print('>%s: error' % name) return results # print and plot the top n results def summarize_results(results, maximize=True, top_n=10): # check for no results if len(results) == 0: print('no results') return # determine how many results to summarize n = min(top_n, len(results)) # create a list of (name, mean(scores)) tuples mean_scores = [(k,mean(v)) for k,v in results.items()] # sort tuples by mean score mean_scores = sorted(mean_scores, key=lambda x: x[1]) # reverse for descending order (e.g. for accuracy) if maximize: mean_scores = list(reversed(mean_scores)) # retrieve the top n for summarization names = [x[0] for x in mean_scores[:n]] scores = [results[x[0]] for x in mean_scores[:n]] # print the top n print() for i in range(n): name = names[i] mean_score, std_score = mean(results[name]), std(results[name]) print('Rank=%d, Name=%s, Score=%.3f (+/- %.3f)' % (i+1, name, mean_score, std_score)) # boxplot for the top n pyplot.boxplot(scores, labels=names) _, labels = pyplot.xticks() pyplot.setp(labels, rotation=90) pyplot.savefig('spotcheck.png') # load dataset X, y = load_dataset() # get model list models = get_models() # evaluate models results = evaluate_models(X, y, models, metric='neg_mean_squared_error') # summarize results summarize_results(results)

Running the example summarizes the performance of each model evaluated, then prints the performance of the top 10 well performing algorithms.

We can see that many of the linear algorithms perhaps found the same optimal solution on this problem. Notably those methods that performed well use regularization as a type of feature selection, allowing them to zoom in on the optimal solution.

This would suggest the importance of feature selection when modeling this problem and that linear methods would be the area to focus, at least for now.

Reviewing the printed scores of evaluated models also shows how poorly nonlinear and ensemble algorithms performed on this problem.

... >bag: -6118.084 (+/-1558.433) >rf: -6127.169 (+/-1594.392) >et: -5017.062 (+/-1037.673) >gbm: -2347.807 (+/-500.364) Rank=1, Name=lars, Score=-0.011 (+/- 0.001) Rank=2, Name=ranscac, Score=-0.011 (+/- 0.001) Rank=3, Name=lr, Score=-0.011 (+/- 0.001) Rank=4, Name=ridge-0.0, Score=-0.011 (+/- 0.001) Rank=5, Name=en-0.0-0.1, Score=-0.011 (+/- 0.001) Rank=6, Name=en-0.0-0.8, Score=-0.011 (+/- 0.001) Rank=7, Name=en-0.0-0.2, Score=-0.011 (+/- 0.001) Rank=8, Name=en-0.0-0.7, Score=-0.011 (+/- 0.001) Rank=9, Name=en-0.0-0.0, Score=-0.011 (+/- 0.001) Rank=10, Name=en-0.0-0.3, Score=-0.011 (+/- 0.001)

A box and whisker plot is created, not really adding value to the analysis of results in this case.

In this section, we explore some handy extensions of the spot check framework.

I find myself using XGBoost and gradient boosting a lot for straight-forward classification and regression problems.

As such, I like to use a course grid across standard configuration parameters of the method when spot checking.

Below is a function to do this that can be used directly in the spot checking framework.

# define gradient boosting models def define_gbm_models(models=dict(), use_xgb=True): # define config ranges rates = [0.001, 0.01, 0.1] trees = [50, 100] ss = [0.5, 0.7, 1.0] depth = [3, 7, 9] # add configurations for l in rates: for e in trees: for s in ss: for d in depth: cfg = [l, e, s, d] if use_xgb: name = 'xgb-' + str(cfg) models[name] = XGBClassifier(learning_rate=l, n_estimators=e, subsample=s, max_depth=d) else: name = 'gbm-' + str(cfg) models[name] = GradientBoostingClassifier(learning_rate=l, n_estimators=e, subsample=s, max_depth=d) print('Defined %d models' % len(models)) return models

By default, the function will use XGBoost models, but can use the sklearn gradient boosting model if the *use_xgb* argument to the function is set to *False*.

Again, we are not trying to optimally tune GBM on the problem, only very quickly find an area in the configuration space that may be worth investigating further.

This function can be used directly on classification and regression problems with only a minor change from “*XGBClassifier*” to “*XGBRegressor*” and “*GradientBoostingClassifier*” to “*GradientBoostingRegressor*“. For example:

# define gradient boosting models def get_gbm_models(models=dict(), use_xgb=True): # define config ranges rates = [0.001, 0.01, 0.1] trees = [50, 100] ss = [0.5, 0.7, 1.0] depth = [3, 7, 9] # add configurations for l in rates: for e in trees: for s in ss: for d in depth: cfg = [l, e, s, d] if use_xgb: name = 'xgb-' + str(cfg) models[name] = XGBRegressor(learning_rate=l, n_estimators=e, subsample=s, max_depth=d) else: name = 'gbm-' + str(cfg) models[name] = GradientBoostingXGBRegressor(learning_rate=l, n_estimators=e, subsample=s, max_depth=d) print('Defined %d models' % len(models)) return models

To make this concrete, below is the binary classification example updated to also define XGBoost models.

# binary classification spot check script import warnings from numpy import mean from numpy import std from matplotlib import pyplot from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import ExtraTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier from xgboost import XGBClassifier # load the dataset, returns X and y elements def load_dataset(): return make_classification(n_samples=1000, n_classes=2, random_state=1) # create a dict of standard models to evaluate {name:object} def define_models(models=dict()): # linear models models['logistic'] = LogisticRegression() alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['ridge-'+str(a)] = RidgeClassifier(alpha=a) models['sgd'] = SGDClassifier(max_iter=1000, tol=1e-3) models['pa'] = PassiveAggressiveClassifier(max_iter=1000, tol=1e-3) # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsClassifier(n_neighbors=k) models['cart'] = DecisionTreeClassifier() models['extra'] = ExtraTreeClassifier() models['svml'] = SVC(kernel='linear') models['svmp'] = SVC(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVC(C=c) models['bayes'] = GaussianNB() # ensemble models n_trees = 100 models['ada'] = AdaBoostClassifier(n_estimators=n_trees) models['bag'] = BaggingClassifier(n_estimators=n_trees) models['rf'] = RandomForestClassifier(n_estimators=n_trees) models['et'] = ExtraTreesClassifier(n_estimators=n_trees) models['gbm'] = GradientBoostingClassifier(n_estimators=n_trees) print('Defined %d models' % len(models)) return models # define gradient boosting models def define_gbm_models(models=dict(), use_xgb=True): # define config ranges rates = [0.001, 0.01, 0.1] trees = [50, 100] ss = [0.5, 0.7, 1.0] depth = [3, 7, 9] # add configurations for l in rates: for e in trees: for s in ss: for d in depth: cfg = [l, e, s, d] if use_xgb: name = 'xgb-' + str(cfg) models[name] = XGBClassifier(learning_rate=l, n_estimators=e, subsample=s, max_depth=d) else: name = 'gbm-' + str(cfg) models[name] = GradientBoostingClassifier(learning_rate=l, n_estimators=e, subsample=s, max_depth=d) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # evaluate a single model def evaluate_model(X, y, model, folds, metric): # create the pipeline pipeline = make_pipeline(model) # evaluate model scores = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) return scores # evaluate a model and try to trap errors and and hide warnings def robust_evaluate_model(X, y, model, folds, metric): scores = None try: with warnings.catch_warnings(): warnings.filterwarnings("ignore") scores = evaluate_model(X, y, model, folds, metric) except: scores = None return scores # evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, folds=10, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, metric) # show process if scores is not None: # store a result results[name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (name, mean_score, std_score)) else: print('>%s: error' % name) return results # print and plot the top n results def summarize_results(results, maximize=True, top_n=10): # check for no results if len(results) == 0: print('no results') return # determine how many results to summarize n = min(top_n, len(results)) # create a list of (name, mean(scores)) tuples mean_scores = [(k,mean(v)) for k,v in results.items()] # sort tuples by mean score mean_scores = sorted(mean_scores, key=lambda x: x[1]) # reverse for descending order (e.g. for accuracy) if maximize: mean_scores = list(reversed(mean_scores)) # retrieve the top n for summarization names = [x[0] for x in mean_scores[:n]] scores = [results[x[0]] for x in mean_scores[:n]] # print the top n print() for i in range(n): name = names[i] mean_score, std_score = mean(results[name]), std(results[name]) print('Rank=%d, Name=%s, Score=%.3f (+/- %.3f)' % (i+1, name, mean_score, std_score)) # boxplot for the top n pyplot.boxplot(scores, labels=names) _, labels = pyplot.xticks() pyplot.setp(labels, rotation=90) pyplot.savefig('spotcheck.png') # load dataset X, y = load_dataset() # get model list models = define_models() # add gbm models models = define_gbm_models(models) # evaluate models results = evaluate_models(X, y, models) # summarize results summarize_results(results)

Running the example shows that indeed some XGBoost models perform well on the problem.

... >xgb-[0.1, 100, 1.0, 3]: 0.864 (+/-0.044) >xgb-[0.1, 100, 1.0, 7]: 0.865 (+/-0.036) >xgb-[0.1, 100, 1.0, 9]: 0.867 (+/-0.039) Rank=1, Name=xgb-[0.1, 50, 1.0, 3], Score=0.872 (+/- 0.039) Rank=2, Name=et, Score=0.869 (+/- 0.033) Rank=3, Name=xgb-[0.1, 50, 1.0, 9], Score=0.868 (+/- 0.038) Rank=4, Name=xgb-[0.1, 100, 1.0, 9], Score=0.867 (+/- 0.039) Rank=5, Name=xgb-[0.01, 50, 1.0, 3], Score=0.867 (+/- 0.035) Rank=6, Name=xgb-[0.1, 50, 1.0, 7], Score=0.867 (+/- 0.037) Rank=7, Name=xgb-[0.001, 100, 0.7, 9], Score=0.866 (+/- 0.040) Rank=8, Name=xgb-[0.01, 100, 1.0, 3], Score=0.866 (+/- 0.037) Rank=9, Name=xgb-[0.001, 100, 0.7, 3], Score=0.866 (+/- 0.034) Rank=10, Name=xgb-[0.01, 50, 0.7, 3], Score=0.866 (+/- 0.034)

The above results also highlight the noisy nature of the evaluations, e.g. the results of extra trees in this run are different from the run above (0.858 vs 0.869).

We are using k-fold cross-validation to produce a population of scores, but the population is small and the calculated mean will be noisy.

This is fine as long as we take the spot-check results as a starting point and not definitive results of an algorithm on the problem. This is hard to do; it takes discipline in the practitioner.

Alternately, you may want to adapt the framework such that the model evaluation scheme better matches the model evaluation scheme you intend to use for your specific problem.

For example, when evaluating stochastic algorithms like bagged or boosted decision trees, it is a good idea to run each experiment multiple times on the same train/test sets (called repeats) in order to account for the stochastic nature of the learning algorithm.

We can update the *evaluate_model()* function to repeat the evaluation of a given model n-times, with a different split of the data each time, then return all scores. For example, three repeats of 10-fold cross-validation will result in 30 scores from each to calculate a mean performance of a model.

# evaluate a single model def evaluate_model(X, y, model, folds, repeats, metric): # create the pipeline pipeline = make_pipeline(model) # evaluate model scores = list() # repeat model evaluation n times for _ in range(repeats): # perform run scores_r = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) # add scores to list scores += scores_r.tolist() return scores

Alternately, you may prefer to calculate a mean score from each k-fold cross-validation run, then calculate a grand mean of all runs, as described in:

We can then update the *robust_evaluate_model()* function to pass down the repeats argument and the *evaluate_models()* function to define a default, such as 3.

A complete example of the binary classification example with three repeats of model evaluation is listed below.

# binary classification spot check script import warnings from numpy import mean from numpy import std from matplotlib import pyplot from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import ExtraTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier # load the dataset, returns X and y elements def load_dataset(): return make_classification(n_samples=1000, n_classes=2, random_state=1) # create a dict of standard models to evaluate {name:object} def define_models(models=dict()): # linear models models['logistic'] = LogisticRegression() alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['ridge-'+str(a)] = RidgeClassifier(alpha=a) models['sgd'] = SGDClassifier(max_iter=1000, tol=1e-3) models['pa'] = PassiveAggressiveClassifier(max_iter=1000, tol=1e-3) # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsClassifier(n_neighbors=k) models['cart'] = DecisionTreeClassifier() models['extra'] = ExtraTreeClassifier() models['svml'] = SVC(kernel='linear') models['svmp'] = SVC(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVC(C=c) models['bayes'] = GaussianNB() # ensemble models n_trees = 100 models['ada'] = AdaBoostClassifier(n_estimators=n_trees) models['bag'] = BaggingClassifier(n_estimators=n_trees) models['rf'] = RandomForestClassifier(n_estimators=n_trees) models['et'] = ExtraTreesClassifier(n_estimators=n_trees) models['gbm'] = GradientBoostingClassifier(n_estimators=n_trees) print('Defined %d models' % len(models)) return models # create a feature preparation pipeline for a model def make_pipeline(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # evaluate a single model def evaluate_model(X, y, model, folds, repeats, metric): # create the pipeline pipeline = make_pipeline(model) # evaluate model scores = list() # repeat model evaluation n times for _ in range(repeats): # perform run scores_r = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) # add scores to list scores += scores_r.tolist() return scores # evaluate a model and try to trap errors and hide warnings def robust_evaluate_model(X, y, model, folds, repeats, metric): scores = None try: with warnings.catch_warnings(): warnings.filterwarnings("ignore") scores = evaluate_model(X, y, model, folds, repeats, metric) except: scores = None return scores # evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, folds=10, repeats=3, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, repeats, metric) # show process if scores is not None: # store a result results[name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (name, mean_score, std_score)) else: print('>%s: error' % name) return results # print and plot the top n results def summarize_results(results, maximize=True, top_n=10): # check for no results if len(results) == 0: print('no results') return # determine how many results to summarize n = min(top_n, len(results)) # create a list of (name, mean(scores)) tuples mean_scores = [(k,mean(v)) for k,v in results.items()] # sort tuples by mean score mean_scores = sorted(mean_scores, key=lambda x: x[1]) # reverse for descending order (e.g. for accuracy) if maximize: mean_scores = list(reversed(mean_scores)) # retrieve the top n for summarization names = [x[0] for x in mean_scores[:n]] scores = [results[x[0]] for x in mean_scores[:n]] # print the top n print() for i in range(n): name = names[i] mean_score, std_score = mean(results[name]), std(results[name]) print('Rank=%d, Name=%s, Score=%.3f (+/- %.3f)' % (i+1, name, mean_score, std_score)) # boxplot for the top n pyplot.boxplot(scores, labels=names) _, labels = pyplot.xticks() pyplot.setp(labels, rotation=90) pyplot.savefig('spotcheck.png') # load dataset X, y = load_dataset() # get model list models = define_models() # evaluate models results = evaluate_models(X, y, models) # summarize results summarize_results(results)

Running the example produces a more robust estimate of the scores.

... >bag: 0.861 (+/-0.037) >rf: 0.859 (+/-0.036) >et: 0.869 (+/-0.035) >gbm: 0.867 (+/-0.044) Rank=1, Name=et, Score=0.869 (+/- 0.035) Rank=2, Name=gbm, Score=0.867 (+/- 0.044) Rank=3, Name=bag, Score=0.861 (+/- 0.037) Rank=4, Name=rf, Score=0.859 (+/- 0.036) Rank=5, Name=ada, Score=0.850 (+/- 0.035) Rank=6, Name=ridge-0.9, Score=0.848 (+/- 0.038) Rank=7, Name=ridge-0.8, Score=0.848 (+/- 0.038) Rank=8, Name=ridge-0.7, Score=0.848 (+/- 0.038) Rank=9, Name=ridge-0.6, Score=0.848 (+/- 0.038) Rank=10, Name=ridge-0.5, Score=0.848 (+/- 0.038)

There will still be some variance in the reported means, but less than a single run of k-fold cross-validation.

The number of repeats may be increased to further reduce this variance, at the cost of longer run times, and perhaps against the intent of spot checking.

I am a big fan of avoiding assumptions and recommendations for data representations prior to fitting models.

Instead, I like to also spot-check multiple representations and transforms of input data, which I refer to as views. I explain this more in the post:

We can update the framework to spot-check multiple different representations for each model.

One way to do this is to update the *evaluate_models()* function so that we can provide a list of *make_pipeline()* functions that can be used for each defined model.

# evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, pipe_funcs, folds=10, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate model under each preparation function for i in range(len(pipe_funcs)): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, metric, pipe_funcs[i]) # update name run_name = str(i) + name # show process if scores is not None: # store a result results[run_name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (run_name, mean_score, std_score)) else: print('>%s: error' % run_name) return results

The chosen pipeline function can then be passed along down to the *robust_evaluate_model()* function and to the *evaluate_model()* function where it can be used.

We can then define a bunch of different pipeline functions; for example:

# no transforms pipeline def pipeline_none(model): return model # standardize transform pipeline def pipeline_standardize(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # normalize transform pipeline def pipeline_normalize(model): steps = list() # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # standardize and normalize pipeline def pipeline_std_norm(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline

Then create a list of these function names that can be provided to the *evaluate_models()* function.

# define transform pipelines pipelines = [pipeline_none, pipeline_standardize, pipeline_normalize, pipeline_std_norm]

The complete example of the classification case updated to spot check pipeline transforms is listed below.

# binary classification spot check script import warnings from numpy import mean from numpy import std from matplotlib import pyplot from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.pipeline import Pipeline from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.linear_model import PassiveAggressiveClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.tree import ExtraTreeClassifier from sklearn.svm import SVC from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import AdaBoostClassifier from sklearn.ensemble import BaggingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.ensemble import GradientBoostingClassifier # load the dataset, returns X and y elements def load_dataset(): return make_classification(n_samples=1000, n_classes=2, random_state=1) # create a dict of standard models to evaluate {name:object} def define_models(models=dict()): # linear models models['logistic'] = LogisticRegression() alpha = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for a in alpha: models['ridge-'+str(a)] = RidgeClassifier(alpha=a) models['sgd'] = SGDClassifier(max_iter=1000, tol=1e-3) models['pa'] = PassiveAggressiveClassifier(max_iter=1000, tol=1e-3) # non-linear models n_neighbors = range(1, 21) for k in n_neighbors: models['knn-'+str(k)] = KNeighborsClassifier(n_neighbors=k) models['cart'] = DecisionTreeClassifier() models['extra'] = ExtraTreeClassifier() models['svml'] = SVC(kernel='linear') models['svmp'] = SVC(kernel='poly') c_values = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] for c in c_values: models['svmr'+str(c)] = SVC(C=c) models['bayes'] = GaussianNB() # ensemble models n_trees = 100 models['ada'] = AdaBoostClassifier(n_estimators=n_trees) models['bag'] = BaggingClassifier(n_estimators=n_trees) models['rf'] = RandomForestClassifier(n_estimators=n_trees) models['et'] = ExtraTreesClassifier(n_estimators=n_trees) models['gbm'] = GradientBoostingClassifier(n_estimators=n_trees) print('Defined %d models' % len(models)) return models # no transforms pipeline def pipeline_none(model): return model # standardize transform pipeline def pipeline_standardize(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # normalize transform pipeline def pipeline_normalize(model): steps = list() # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # standardize and normalize pipeline def pipeline_std_norm(model): steps = list() # standardization steps.append(('standardize', StandardScaler())) # normalization steps.append(('normalize', MinMaxScaler())) # the model steps.append(('model', model)) # create pipeline pipeline = Pipeline(steps=steps) return pipeline # evaluate a single model def evaluate_model(X, y, model, folds, metric, pipe_func): # create the pipeline pipeline = pipe_func(model) # evaluate model scores = cross_val_score(pipeline, X, y, scoring=metric, cv=folds, n_jobs=-1) return scores # evaluate a model and try to trap errors and and hide warnings def robust_evaluate_model(X, y, model, folds, metric, pipe_func): scores = None try: with warnings.catch_warnings(): warnings.filterwarnings("ignore") scores = evaluate_model(X, y, model, folds, metric, pipe_func) except: scores = None return scores # evaluate a dict of models {name:object}, returns {name:score} def evaluate_models(X, y, models, pipe_funcs, folds=10, metric='accuracy'): results = dict() for name, model in models.items(): # evaluate model under each preparation function for i in range(len(pipe_funcs)): # evaluate the model scores = robust_evaluate_model(X, y, model, folds, metric, pipe_funcs[i]) # update name run_name = str(i) + name # show process if scores is not None: # store a result results[run_name] = scores mean_score, std_score = mean(scores), std(scores) print('>%s: %.3f (+/-%.3f)' % (run_name, mean_score, std_score)) else: print('>%s: error' % run_name) return results # print and plot the top n results def summarize_results(results, maximize=True, top_n=10): # check for no results if len(results) == 0: print('no results') return # determine how many results to summarize n = min(top_n, len(results)) # create a list of (name, mean(scores)) tuples mean_scores = [(k,mean(v)) for k,v in results.items()] # sort tuples by mean score mean_scores = sorted(mean_scores, key=lambda x: x[1]) # reverse for descending order (e.g. for accuracy) if maximize: mean_scores = list(reversed(mean_scores)) # retrieve the top n for summarization names = [x[0] for x in mean_scores[:n]] scores = [results[x[0]] for x in mean_scores[:n]] # print the top n print() for i in range(n): name = names[i] mean_score, std_score = mean(results[name]), std(results[name]) print('Rank=%d, Name=%s, Score=%.3f (+/- %.3f)' % (i+1, name, mean_score, std_score)) # boxplot for the top n pyplot.boxplot(scores, labels=names) _, labels = pyplot.xticks() pyplot.setp(labels, rotation=90) pyplot.savefig('spotcheck.png') # load dataset X, y = load_dataset() # get model list models = define_models() # define transform pipelines pipelines = [pipeline_none, pipeline_standardize, pipeline_normalize, pipeline_std_norm] # evaluate models results = evaluate_models(X, y, models, pipelines) # summarize results summarize_results(results)

Running the example shows that we differentiate the results for each pipeline by adding the pipeline number to the beginning of the algorithm description name, e.g. ‘*0rf*‘ means RF with the first pipeline, which is no transforms.

The ensembles of trees algorithms perform well on this problem, and these algorithms are invariant to data scaling. This means that their results on each pipeline will be similar (or the same) and in turn they will crowd out other algorithms in the top-10 list.

... >0gbm: 0.865 (+/-0.044) >1gbm: 0.865 (+/-0.044) >2gbm: 0.865 (+/-0.044) >3gbm: 0.865 (+/-0.044) Rank=1, Name=3rf, Score=0.870 (+/- 0.034) Rank=2, Name=2rf, Score=0.870 (+/- 0.034) Rank=3, Name=1rf, Score=0.870 (+/- 0.034) Rank=4, Name=0rf, Score=0.870 (+/- 0.034) Rank=5, Name=3bag, Score=0.866 (+/- 0.039) Rank=6, Name=2bag, Score=0.866 (+/- 0.039) Rank=7, Name=1bag, Score=0.866 (+/- 0.039) Rank=8, Name=0bag, Score=0.866 (+/- 0.039) Rank=9, Name=3gbm, Score=0.865 (+/- 0.044) Rank=10, Name=2gbm, Score=0.865 (+/- 0.044)

This section provides more resources on the topic if you are looking to go deeper.

- Why you should be Spot-Checking Algorithms on your Machine Learning Problems
- Spot-Check Classification Machine Learning Algorithms in Python with scikit-learn
- Spot-Check Regression Machine Learning Algorithms in Python with scikit-learn
- How to Evaluate the Skill of Deep Learning Models
- Why Applied Machine Learning Is Hard
- A Gentle Introduction to Applied Machine Learning as a Search Problem
- How to Get the Most From Your Machine Learning Data

In this tutorial, you discovered the usefulness of spot-checking algorithms on a new predictive modeling problem and how to develop a standard framework for spot-checking algorithms in python for classification and regression problems.

Specifically, you learned:

- Spot-checking provides a way to quickly discover the types of algorithms that perform well on your predictive modeling problem.
- How to develop a generic framework for loading data, defining models, evaluating models, and summarizing results.
- How to apply the framework for classification and regression problems.

Have you used this framework or do you have some further suggestions to improve it?

Let me know in the comments.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Develop a Reusable Framework to Spot-Check Algorithms in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Probability Scoring Methods in Python appeared first on Machine Learning Mastery.

]]>Develop an Intuition for Different Metrics.

Predicting probabilities instead of class labels for a classification problem can provide additional nuance and uncertainty for the predictions.

The added nuance allows more sophisticated metrics to be used to interpret and evaluate the predicted probabilities. In general, methods for the evaluation of the accuracy of predicted probabilities are referred to as scoring rules or scoring functions.

In this tutorial, you will discover three scoring methods that you can use to evaluate the predicted probabilities on your classification predictive modeling problem.

After completing this tutorial, you will know:

- The log loss score that heavily penalizes predicted probabilities far away from their expected value.
- The Brier score that is gentler than log loss but still penalizes proportional to the distance from the expected value.
- The area under ROC curve that summarizes the likelihood of the model predicting a higher probability for true positive cases than true negative cases.

Let’s get started.

**Update Sept/2018**: Fixed description of AUC no skill.

This tutorial is divided into four parts; they are:

- Log Loss Score
- Brier Score
- ROC AUC Score
- Tuning Predicted Probabilities

Log loss, also called “logistic loss,” “logarithmic loss,” or “cross entropy” can be used as a measure for evaluating predicted probabilities.

Each predicted probability is compared to the actual class output value (0 or 1) and a score is calculated that penalizes the probability based on the distance from the expected value. The penalty is logarithmic, offering a small score for small differences (0.1 or 0.2) and enormous score for a large difference (0.9 or 1.0).

A model with perfect skill has a log loss score of 0.0.

In order to summarize the skill of a model using log loss, the log loss is calculated for each predicted probability, and the average loss is reported.

The log loss can be implemented in Python using the log_loss() function in scikit-learn.

For example:

from sklearn.metrics import log_loss ... model = ... testX, testy = ... # predict probabilities probs = model.predict_proba(testX) # keep the predictions for class 1 only probs = probs[:, 1] # calculate log loss loss = log_loss(testy, probs)

In the binary classification case, the function takes a list of true outcome values and a list of probabilities as arguments and calculates the average log loss for the predictions.

We can make a single log loss score concrete with an example.

Given a specific known outcome of 0, we can predict values of 0.0 to 1.0 in 0.01 increments (101 predictions) and calculate the log loss for each. The result is a curve showing how much each prediction is penalized as the probability gets further away from the expected value. We can repeat this for a known outcome of 1 and see the same curve in reverse.

The complete example is listed below.

# plot impact of logloss for single forecasts from sklearn.metrics import log_loss from matplotlib import pyplot from numpy import array # predictions as 0 to 1 in 0.01 increments yhat = [x*0.01 for x in range(0, 101)] # evaluate predictions for a 0 true value losses_0 = [log_loss([0], [x], labels=[0,1]) for x in yhat] # evaluate predictions for a 1 true value losses_1 = [log_loss([1], [x], labels=[0,1]) for x in yhat] # plot input to loss pyplot.plot(yhat, losses_0, label='true=0') pyplot.plot(yhat, losses_1, label='true=1') pyplot.legend() pyplot.show()

Running the example creates a line plot showing the loss scores for probability predictions from 0.0 to 1.0 for both the case where the true label is 0 and 1.

This helps to build an intuition for the effect that the loss score has when evaluating predictions.

Model skill is reported as the average log loss across the predictions in a test dataset.

As an average, we can expect that the score will be suitable with a balanced dataset and misleading when there is a large imbalance between the two classes in the test set. This is because predicting 0 or small probabilities will result in a small loss.

We can demonstrate this by comparing the distribution of loss values when predicting different constant probabilities for a balanced and an imbalanced dataset.

First, the example below predicts values from 0.0 to 1.0 in 0.1 increments for a balanced dataset of 50 examples of class 0 and 1.

# plot impact of logloss with balanced datasets from sklearn.metrics import log_loss from matplotlib import pyplot from numpy import array # define an imbalanced dataset testy = [0 for x in range(50)] + [1 for x in range(50)] # loss for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [log_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show()

Running the example, we can see that a model is better-off predicting probabilities values that are not sharp (close to the edge) and are back towards the middle of the distribution.

The penalty of being wrong with a sharp probability is very large.

We can repeat this experiment with an imbalanced dataset with a 10:1 ratio of class 0 to class 1.

# plot impact of logloss with imbalanced datasets from sklearn.metrics import log_loss from matplotlib import pyplot from numpy import array # define an imbalanced dataset testy = [0 for x in range(100)] + [1 for x in range(10)] # loss for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [log_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show()

Here, we can see that a model that is skewed towards predicting very small probabilities will perform well, optimistically so.

The naive model that predicts a constant probability of 0.1 will be the baseline model to beat.

The result suggests that model skill evaluated with log loss should be interpreted carefully in the case of an imbalanced dataset, perhaps adjusted relative to the base rate for class 1 in the dataset.

The Brier score, named for Glenn Brier, calculates the mean squared error between predicted probabilities and the expected values.

The score summarizes the magnitude of the error in the probability forecasts.

The error score is always between 0.0 and 1.0, where a model with perfect skill has a score of 0.0.

Predictions that are further away from the expected probability are penalized, but less severely as in the case of log loss.

The skill of a model can be summarized as the average Brier score across all probabilities predicted for a test dataset.

The Brier score can be calculated in Python using the brier_score_loss() function in scikit-learn. It takes the true class values (0, 1) and the predicted probabilities for all examples in a test dataset as arguments and returns the average Brier score.

For example:

from sklearn.metrics import brier_score_loss ... model = ... testX, testy = ... # predict probabilities probs = model.predict_proba(testX) # keep the predictions for class 1 only probs = probs[:, 1] # calculate bier score loss = brier_score_loss(testy, probs)

We can evaluate the impact of prediction errors by comparing the Brier score for single probability forecasts in increasing error from 0.0 to 1.0.

The complete example is listed below.

# plot impact of brier for single forecasts from sklearn.metrics import brier_score_loss from matplotlib import pyplot from numpy import array # predictions as 0 to 1 in 0.01 increments yhat = [x*0.01 for x in range(0, 101)] # evaluate predictions for a 1 true value losses = [brier_score_loss([1], [x], pos_label=[1]) for x in yhat] # plot input to loss pyplot.plot(yhat, losses) pyplot.show()

Running the example creates a plot of the probability prediction error in absolute terms (x-axis) to the calculated Brier score (y axis).

We can see a familiar quadratic curve, increasing from 0 to 1 with the squared error.

Model skill is reported as the average Brier across the predictions in a test dataset.

As with log loss, we can expect that the score will be suitable with a balanced dataset and misleading when there is a large imbalance between the two classes in the test set.

We can demonstrate this by comparing the distribution of loss values when predicting different constant probabilities for a balanced and an imbalanced dataset.

First, the example below predicts values from 0.0 to 1.0 in 0.1 increments for a balanced dataset of 50 examples of class 0 and 1.

# plot impact of brier score with balanced datasets from sklearn.metrics import brier_score_loss from matplotlib import pyplot from numpy import array # define an imbalanced dataset testy = [0 for x in range(50)] + [1 for x in range(50)] # brier score for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [brier_score_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show()

Running the example, we can see that a model is better-off predicting middle of the road probabilities values like 0.5.

Unlike log loss that is quite flat for close probabilities, the parabolic shape shows the clear quadratic increase in the score penalty as the error is increased.

We can repeat this experiment with an imbalanced dataset with a 10:1 ratio of class 0 to class 1.

# plot impact of brier score with imbalanced datasets from sklearn.metrics import brier_score_loss from matplotlib import pyplot from numpy import array # define an imbalanced dataset testy = [0 for x in range(100)] + [1 for x in range(10)] # brier score for predicting different fixed probability values predictions = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] losses = [brier_score_loss(testy, [y for x in range(len(testy))]) for y in predictions] # plot predictions vs loss pyplot.plot(predictions, losses) pyplot.show()

Running the example, we see a very different picture for the imbalanced dataset.

Like the average log loss, the average Brier score will present optimistic scores on an imbalanced dataset, rewarding small prediction values that reduce error on the majority class.

In these cases, Brier score should be compared relative to the naive prediction (e.g. the base rate of the minority class or 0.1 in the above example) or normalized by the naive score.

This latter example is common and is called the Brier Skill Score (BSS).

BSS = 1 - (BS / BS_ref)

Where BS is the Brier skill of model, and BS_ref is the Brier skill of the naive prediction.

The Brier Skill Score reports the relative skill of the probability prediction over the naive forecast.

A good update to the scikit-learn API would be to add a parameter to the *brier_score_loss()* to support the calculation of the Brier Skill Score.

A predicted probability for a binary (two-class) classification problem can be interpreted with a threshold.

The threshold defines the point at which the probability is mapped to class 0 versus class 1, where the default threshold is 0.5. Alternate threshold values allow the model to be tuned for higher or lower false positives and false negatives.

Tuning the threshold by the operator is particularly important on problems where one type of error is more or less important than another or when a model is makes disproportionately more or less of a specific type of error.

The Receiver Operating Characteristic, or ROC, curve is a plot of the true positive rate versus the false positive rate for the predictions of a model for multiple thresholds between 0.0 and 1.0.

Predictions that have no skill for a given threshold are drawn on the diagonal of the plot from the bottom left to the top right. This line represents no-skill predictions for each threshold.

Models that have skill have a curve above this diagonal line that bows towards the top left corner.

Below is an example of fitting a logistic regression model on a binary classification problem and calculating and plotting the ROC curve for the predicted probabilities on a test set of 500 new data instances.

# roc curve from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression() model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, probs) # plot no skill pyplot.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model pyplot.plot(fpr, tpr) # show the plot pyplot.show()

Running the example creates an example of a ROC curve that can be compared to the no skill line on the main diagonal.

The integrated area under the ROC curve, called AUC or ROC AUC, provides a measure of the skill of the model across all evaluated thresholds.

An AUC score of 0.5 suggests no skill, e.g. a curve along the diagonal, whereas an AUC of 1.0 suggests perfect skill, all points along the left y-axis and top x-axis toward the top left corner. An AUC of 0.0 suggests perfectly incorrect predictions.

Predictions by models that have a larger area have better skill across the thresholds, although the specific shape of the curves between models will vary, potentially offering opportunity to optimize models by a pre-chosen threshold. Typically, the threshold is chosen by the operator after the model has been prepared.

The AUC can be calculated in Python using the roc_auc_score() function in scikit-learn.

This function takes a list of true output values and predicted probabilities as arguments and returns the ROC AUC.

For example:

from sklearn.metrics import roc_auc_score ... model = ... testX, testy = ... # predict probabilities probs = model.predict_proba(testX) # keep the predictions for class 1 only probs = probs[:, 1] # calculate log loss loss = roc_auc_score(testy, probs)

An AUC score is a measure of the likelihood that the model that produced the predictions will rank a randomly chosen positive example above a randomly chosen negative example. Specifically, that the probability will be higher for a real event (class=1) than a real non-event (class=0).

This is an instructive definition that offers two important intuitions:

**Naive Prediction**. A naive prediction under ROC AUC is any constant probability. If the same probability is predicted for every example, there is no discrimination between positive and negative cases, therefore the model has no skill (AUC=0.5).**Insensitivity to Class Imbalance**. ROC AUC is a summary on the models ability to correctly discriminate a single example across different thresholds. As such, it is unconcerned with the base likelihood of each class.

Below, the example demonstrating the ROC curve is updated to calculate and display the AUC.

# roc auc from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = LogisticRegression() model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate roc auc auc = roc_auc_score(testy, probs) print(auc)

Running the example calculates and prints the ROC AUC for the logistic regression model evaluated on 500 new examples.

0.9028044871794871

An important consideration in choosing the ROC AUC is that it does not summarize the specific discriminative power of the model, rather the general discriminative power across all thresholds.

It might be a better tool for model selection rather than in quantifying the practical skill of a model’s predicted probabilities.

Predicted probabilities can be tuned to improve or even game a performance measure.

For example, the log loss and Brier scores quantify the average amount of error in the probabilities. As such, predicted probabilities can be tuned to improve these scores in a few ways:

**Making the probabilities less sharp (less confident)**. This means adjusting the predicted probabilities away from the hard 0 and 1 bounds to limit the impact of penalties of being completely wrong.**Shift the distribution to the naive prediction (base rate)**. This means shifting the mean of the predicted probabilities to the probability of the base rate, such as 0.5 for a balanced prediction problem.

Generally, it may be useful to review the calibration of the probabilities using tools like a reliability diagram. This can be achieved using the calibration_curve() function in scikit-learn.

Some algorithms, such as SVM and neural networks, may not predict calibrated probabilities natively. In these cases, the probabilities can be calibrated and in turn may improve the chosen metric. Classifiers can be calibrated in scikit-learn using the CalibratedClassifierCV class.

This section provides more resources on the topic if you are looking to go deeper.

- sklearn.metrics.log_loss API
- sklearn.metrics.brier_score_loss API
- sklearn.metrics.roc_curve API
- sklearn.metrics.roc_auc_score API
- sklearn.calibration.calibration_curve API
- sklearn.calibration.CalibratedClassifierCV API

- Scoring rule, Wikipedia
- Cross entropy, Wikipedia
- Log Loss, fast.ai
- Brier score, Wikipedia
- Receiver operating characteristic, Wikipedia

In this tutorial, you discovered three metrics that you can use to evaluate the predicted probabilities on your classification predictive modeling problem.

Specifically, you learned:

- The log loss score that heavily penalizes predicted probabilities far away from their expected value.
- The Brier score that is gentler than log loss but still penalizes proportional to the distance from the expected value
- The area under ROC curve that summarizes the likelihood of the model predicting a higher probability for true positive cases than true negative cases.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Probability Scoring Methods in Python appeared first on Machine Learning Mastery.

]]>The post How and When to Use a Calibrated Classification Model with scikit-learn appeared first on Machine Learning Mastery.

]]>Predicting probabilities allows some flexibility including deciding how to interpret the probabilities, presenting predictions with uncertainty, and providing more nuanced ways to evaluate the skill of the model.

Predicted probabilities that match the expected distribution of probabilities for each class are referred to as calibrated. The problem is, not all machine learning models are capable of predicting calibrated probabilities.

There are methods to both diagnose how calibrated predicted probabilities are and to better calibrate the predicted probabilities with the observed distribution of each class. Often, this can lead to better quality predictions, depending on how the skill of the model is evaluated.

In this tutorial, you will discover the importance of calibrating predicted probabilities and how to diagnose and improve the calibration of models used for probabilistic classification.

After completing this tutorial, you will know:

- Nonlinear machine learning algorithms often predict uncalibrated class probabilities.
- Reliability diagrams can be used to diagnose the calibration of a model, and methods can be used to better calibrate predictions for a problem.
- How to develop reliability diagrams and calibrate classification models in Python with scikit-learn.

Let’s get started.

This tutorial is divided into four parts; they are:

- Predicting Probabilities
- Calibration of Predictions
- How to Calibrate Probabilities in Python
- Worked Example of Calibrating SVM Probabilities

A classification predictive modeling problem requires predicting or forecasting a label for a given observation.

An alternative to predicting the label directly, a model may predict the probability of an observation belonging to each possible class label.

This provides some flexibility both in the way predictions are interpreted and presented (choice of threshold and prediction uncertainty) and in the way the model is evaluated.

Although a model may be able to predict probabilities, the distribution and behavior of the probabilities may not match the expected distribution of observed probabilities in the training data.

This is especially common with complex nonlinear machine learning algorithms that do not directly make probabilistic predictions and instead use approximations.

The distribution of the probabilities can be adjusted to better match the expected distribution observed in the data. This adjustment is referred to as calibration, as in the calibration of the model or the calibration of the distribution of class probabilities.

[…] we desire that the estimated class probabilities are reflective of the true underlying probability of the sample. That is, the predicted class probability (or probability-like value) needs to be well-calibrated. To be well-calibrated, the probabilities must effectively reflect the true likelihood of the event of interest.

— Page 249, Applied Predictive Modeling, 2013.

There are two concerns in calibrating probabilities; they are diagnosing the calibration of predicted probabilities and the calibration process itself.

A reliability diagram is a line plot of the relative frequency of what was observed (y-axis) versus the predicted probability frequency (x-axis).

Reliability diagrams are common aids for illustrating the properties of probabilistic forecast systems. They consist of a plot of the observed relative frequency against the predicted probability, providing a quick visual intercomparison when tuning probabilistic forecast systems, as well as documenting the performance of the final product

— Increasing the Reliability of Reliability Diagrams, 2007.

Specifically, the predicted probabilities are divided up into a fixed number of buckets along the x-axis. The number of events (class=1) are then counted for each bin (e.g. the relative observed frequency). Finally, the counts are normalized. The results are then plotted as a line plot.

These plots are commonly referred to as ‘*reliability*‘ diagrams in forecast literature, although may also be called ‘*calibration*‘ plots or curves as they summarize how well the forecast probabilities are calibrated.

The better calibrated or more reliable a forecast, the closer the points will appear along the main diagonal from the bottom left to the top right of the plot.

The position of the points or the curve relative to the diagonal can help to interpret the probabilities; for example:

**Below the diagonal**: The model has over-forecast; the probabilities are too large.**Above the diagonal**: The model has under-forecast; the probabilities are too small.

Probabilities, by definition, are continuous, so we expect some separation from the line, often shown as an S-shaped curve showing pessimistic tendencies over-forecasting low probabilities and under-forecasting high probabilities.

Reliability diagrams provide a diagnostic to check whether the forecast value Xi is reliable. Roughly speaking, a probability forecast is reliable if the event actually happens with an observed relative frequency consistent with the forecast value.

— Increasing the Reliability of Reliability Diagrams, 2007.

The reliability diagram can help to understand the relative calibration of the forecasts from different predictive models.

The predictions made by a predictive model can be calibrated.

Calibrated predictions may (or may not) result in an improved calibration on a reliability diagram.

Some algorithms are fit in such a way that their predicted probabilities are already calibrated. Without going into details why, logistic regression is one such example.

Other algorithms do not directly produce predictions of probabilities, and instead a prediction of probabilities must be approximated. Some examples include neural networks, support vector machines, and decision trees.

The predicted probabilities from these methods will likely be uncalibrated and may benefit from being modified via calibration.

Calibration of prediction probabilities is a rescaling operation that is applied after the predictions have been made by a predictive model.

There are two popular approaches to calibrating probabilities; they are the Platt Scaling and Isotonic Regression.

Platt Scaling is simpler and is suitable for reliability diagrams with the S-shape. Isotonic Regression is more complex, requires a lot more data (otherwise it may overfit), but can support reliability diagrams with different shapes (is nonparametric).

Platt Scaling is most effective when the distortion in the predicted probabilities is sigmoid-shaped. Isotonic Regression is a more powerful calibration method that can correct any monotonic distortion. Unfortunately, this extra power comes at a price. A learning curve analysis shows that Isotonic Regression is more prone to overfitting, and thus performs worse than Platt Scaling, when data is scarce.

— Predicting Good Probabilities With Supervised Learning, 2005.

**Note, and this is really important**: better calibrated probabilities may or may not lead to better class-based or probability-based predictions. It really depends on the specific metric used to evaluate predictions.

In fact, some empirical results suggest that the algorithms that can benefit the more from calibrating predicted probabilities include SVMs, bagged decision trees, and random forests.

[…] after calibration the best methods are boosted trees, random forests and SVMs.

— Predicting Good Probabilities With Supervised Learning, 2005.

The scikit-learn machine learning library allows you to both diagnose the probability calibration of a classifier and calibrate a classifier that can predict probabilities.

You can diagnose the calibration of a classifier by creating a reliability diagram of the actual probabilities versus the predicted probabilities on a test set.

In scikit-learn, this is called a calibration curve.

This can be implemented by first calculating the calibration_curve() function. This function takes the true class values for a dataset and the predicted probabilities for the main class (class=1). The function returns the true probabilities for each bin and the predicted probabilities for each bin. The number of bins can be specified via the *n_bins* argument and default to 5.

For example, below is a code snippet showing the API usage:

... # predict probabilities probs = model.predic_proba(testX)[:,1] # reliability diagram fop, mpv = calibration_curve(testy, probs, n_bins=10) # plot perfectly calibrated pyplot.plot([0, 1], [0, 1], linestyle='--') # plot model reliability pyplot.plot(mpv, fop, marker='.') pyplot.show()

A classifier can be calibrated in scikit-learn using the CalibratedClassifierCV class.

There are two ways to use this class: prefit and cross-validation.

You can fit a model on a training dataset and calibrate this prefit model using a hold out validation dataset.

For example, below is a code snippet showing the API usage:

... # prepare data trainX, trainy = ... valX, valy = ... testX, testy = ... # fit base model on training dataset model = ... model.fit(trainX, trainy) # calibrate model on validation data calibrator = CalibratedClassifierCV(model, cv='prefit') calibrator.fit(valX, valy) # evaluate the model yhat = calibrator.predict(testX)

Alternately, the CalibratedClassifierCV can fit multiple copies of the model using k-fold cross-validation and calibrate the probabilities predicted by these models using the hold out set. Predictions are made using each of the trained models.

For example, below is a code snippet showing the API usage:

... # prepare data trainX, trainy = ... testX, testy = ... # define base model model = ... # fit and calibrate model on training data calibrator = CalibratedClassifierCV(model, cv=3) calibrator.fit(trainX, trainy) # evaluate the model yhat = calibrator.predict(testX)

The CalibratedClassifierCV class supports two types of probability calibration; specifically, the parametric ‘*sigmoid*‘ method (Platt’s method) and the nonparametric ‘*isotonic*‘ method which can be specified via the ‘*method*‘ argument.

We can make the discussion of calibration concrete with some worked examples.

In these examples, we will fit a support vector machine (SVM) to a noisy binary classification problem and use the model to predict probabilities, then review the calibration using a reliability diagram and calibrate the classifier and review the result.

SVM is a good candidate model to calibrate because it does not natively predict probabilities, meaning the probabilities are often uncalibrated.

**A note on SVM**: probabilities can be predicted by calling the *decision_function()* function on the fit model instead of the usual *predict_proba()* function. The probabilities are not normalized, but can be normalized when calling the *calibration_curve()* function by setting the ‘*normalize*‘ argument to ‘*True*‘.

The example below fits an SVM model on the test problem, predicted probabilities, and plots the calibration of the probabilities as a reliability diagram,

# SVM reliability diagram from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.model_selection import train_test_split from sklearn.calibration import calibration_curve from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = SVC() model.fit(trainX, trainy) # predict probabilities probs = model.decision_function(testX) # reliability diagram fop, mpv = calibration_curve(testy, probs, n_bins=10, normalize=True) # plot perfectly calibrated pyplot.plot([0, 1], [0, 1], linestyle='--') # plot model reliability pyplot.plot(mpv, fop, marker='.') pyplot.show()

Running the example creates a reliability diagram showing the calibration of the SVMs predicted probabilities (solid line) compared to a perfectly calibrated model along the diagonal of the plot (dashed line.)

We can see the expected S-shaped curve of a conservative forecast.

We can update the example to fit the SVM via the CalibratedClassifierCV class using 5-fold cross-validation, using the holdout sets to calibrate the predicted probabilities.

The complete example is listed below.

# SVM reliability diagram with calibration from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.calibration import CalibratedClassifierCV from sklearn.model_selection import train_test_split from sklearn.calibration import calibration_curve from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = SVC() calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5) calibrated.fit(trainX, trainy) # predict probabilities probs = calibrated.predict_proba(testX)[:, 1] # reliability diagram fop, mpv = calibration_curve(testy, probs, n_bins=10, normalize=True) # plot perfectly calibrated pyplot.plot([0, 1], [0, 1], linestyle='--') # plot calibrated reliability pyplot.plot(mpv, fop, marker='.') pyplot.show()

Running the example creates a reliability diagram for the calibrated probabilities.

The shape of the calibrated probabilities is different, hugging the diagonal line much better, although still under-forecasting in the upper quadrant.

Visually, the plot suggests a better calibrated model.

We can make the contrast between the two models more obvious by including both reliability diagrams on the same plot.

The complete example is listed below.

# SVM reliability diagrams with uncalibrated and calibrated probabilities from sklearn.datasets import make_classification from sklearn.svm import SVC from sklearn.calibration import CalibratedClassifierCV from sklearn.model_selection import train_test_split from sklearn.calibration import calibration_curve from matplotlib import pyplot # predict uncalibrated probabilities def uncalibrated(trainX, testX, trainy): # fit a model model = SVC() model.fit(trainX, trainy) # predict probabilities return model.decision_function(testX) # predict calibrated probabilities def calibrated(trainX, testX, trainy): # define model model = SVC() # define and fit calibration model calibrated = CalibratedClassifierCV(model, method='sigmoid', cv=5) calibrated.fit(trainX, trainy) # predict probabilities return calibrated.predict_proba(testX)[:, 1] # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # uncalibrated predictions yhat_uncalibrated = uncalibrated(trainX, testX, trainy) # calibrated predictions yhat_calibrated = calibrated(trainX, testX, trainy) # reliability diagrams fop_uncalibrated, mpv_uncalibrated = calibration_curve(testy, yhat_uncalibrated, n_bins=10, normalize=True) fop_calibrated, mpv_calibrated = calibration_curve(testy, yhat_calibrated, n_bins=10) # plot perfectly calibrated pyplot.plot([0, 1], [0, 1], linestyle='--', color='black') # plot model reliabilities pyplot.plot(mpv_uncalibrated, fop_uncalibrated, marker='.') pyplot.plot(mpv_calibrated, fop_calibrated, marker='.') pyplot.show()

Running the example creates a single reliability diagram showing both the calibrated (orange) and uncalibrated (blue) probabilities.

It is not really an apples-to-apples comparison as the predictions made by the calibrated model are in fact a combination of five submodels.

Nevertheless, we do see a marked difference in the reliability of the calibrated probabilities (very likely caused by the calibration process).

This section provides more resources on the topic if you are looking to go deeper.

- Applied Predictive Modeling, 2013.
- Predicting Good Probabilities With Supervised Learning, 2005.
- Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers, 2001.
- Increasing the Reliability of Reliability Diagrams, 2007.

- sklearn.calibration.CalibratedClassifierCV API
- sklearn.calibration.calibration_curve API
- Probability calibration, scikit-learn User Guide
- Probability Calibration curves, scikit-learn
- Comparison of Calibration of Classifiers, scikit-learn

- CAWCAR Verification Website
- Calibration (statistics) on Wikipedia
- Probabilistic classification on Wikipedia
- Scikit correct way to calibrate classifiers with CalibratedClassifierCV on CrossValidated

In this tutorial, you discovered the importance of calibrating predicted probabilities and how to diagnose and improve the calibration of models used for probabilistic classification.

Specifically, you learned:

- Nonlinear machine learning algorithms often predict uncalibrated class probabilities.
- Reliability diagrams can be used to diagnose the calibration of a model, and methods can be used to better calibrate predictions for a problem.
- How to develop reliability diagrams and calibrate classification models in Python with scikit-learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How and When to Use a Calibrated Classification Model with scikit-learn appeared first on Machine Learning Mastery.

]]>The post How and When to Use ROC Curves and Precision-Recall Curves for Classification in Python appeared first on Machine Learning Mastery.

]]>This flexibility comes from the way that probabilities may be interpreted using different thresholds that allow the operator of the model to trade-off concerns in the errors made by the model, such as the number of false positives compared to the number of false negatives. This is required when using models where the cost of one error outweighs the cost of other types of errors.

Two diagnostic tools that help in the interpretation of probabilistic forecast for binary (two-class) classification predictive modeling problems are ROC Curves and Precision-Recall curves.

In this tutorial, you will discover ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems.

After completing this tutorial, you will know:

- ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
- Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
- ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Let’s get started.

**Update Aug/2018**: Fixed bug in the representation of the no skill line for the precision-recall plot. Also fixed typo where I referred to ROC as relative rather than receiver (thanks spellcheck).**Update Nov/2018**: Fixed description on interpreting size of values on each axis, thanks Karl Humphries.

This tutorial is divided into 6 parts; they are:

- Predicting Probabilities
- What Are ROC Curves?
- ROC Curves and AUC in Python
- What Are Precision-Recall Curves?
- Precision-Recall Curves and AUC in Python
- When to Use ROC vs. Precision-Recall Curves?

In a classification problem, we may decide to predict the class values directly.

Alternately, it can be more flexible to predict the probabilities for each class instead. The reason for this is to provide the capability to choose and even calibrate the threshold for how to interpret the predicted probabilities.

For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1).

This threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error.

When making a prediction for a binary or two-class classification problem, there are two types of errors that we could make.

**False Positive**. Predict an event when there was no event.**False Negative**. Predict no event when in fact there was an event.

By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model.

For example, in a smog prediction system, we may be far more concerned with having low false negatives than low false positives. A false negative would mean not warning about a smog day when in fact it is a high smog day, leading to health issues in the public that are unable to take precautions. A false positive means the public would take precautionary measures when they didn’t need to.

A common way to compare models that predict probabilities for two-class problems is to use a ROC curve.

A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve, or ROC curve.

It is a plot of the false positive rate (x-axis) versus the true positive rate (y-axis) for a number of different candidate threshold values between 0.0 and 1.0. Put another way, it plots the false alarm rate versus the hit rate.

The true positive rate is calculated as the number of true positives divided by the sum of the number of true positives and the number of false negatives. It describes how good the model is at predicting the positive class when the actual outcome is positive.

True Positive Rate = True Positives / (True Positives + False Negatives)

The true positive rate is also referred to as sensitivity.

Sensitivity = True Positives / (True Positives + False Negatives)

The false positive rate is calculated as the number of false positives divided by the sum of the number of false positives and the number of true negatives.

It is also called the false alarm rate as it summarizes how often a positive class is predicted when the actual outcome is negative.

False Positive Rate = False Positives / (False Positives + True Negatives)

The false positive rate is also referred to as the inverted specificity where specificity is the total number of true negatives divided by the sum of the number of true negatives and false positives.

Specificity = True Negatives / (True Negatives + False Positives)

Where:

False Positive Rate = 1 - Specificity

The ROC curve is a useful tool for a few reasons:

- The curves of different models can be compared directly in general or for different thresholds.
- The area under the curve (AUC) can be used as a summary of the model skill.

The shape of the curve contains a lot of information, including what we might care about most for a problem, the expected false positive rate, and the false negative rate.

To make this clear:

- Smaller values on the x-axis of the plot indicate lower false positives and higher true negatives.
- Larger values on the y-axis of the plot indicate higher true positives and lower false negatives.

If you are confused, remember, when we predict a binary outcome, it is either a correct prediction (true positive) or not (false positive). There is a tension between these options, the same with true negative and false positives.

A skilful model will assign a higher probability to a randomly chosen real positive occurrence than a negative occurrence on average. This is what we mean when we say that the model has skill. Generally, skilful models are represented by curves that bow up to the top left of the plot.

A model with no skill is represented at the point [0.5, 0.5]. A model with no skill at each threshold is represented by a diagonal line from the bottom left of the plot to the top right and has an AUC of 0.5.

A model with perfect skill is represented at a point [0.0 ,1.0]. A model with perfect skill is represented by a line that travels from the bottom left of the plot to the top left and then across the top to the top right.

An operator may plot the ROC curve for the final model and choose a threshold that gives a desirable balance between the false positives and false negatives.

We can plot a ROC curve for a model in Python using the *roc_curve()* scikit-learn function.

The function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. The function returns the false positive rates for each threshold, true positive rates for each threshold and thresholds.

# calculate roc curve fpr, tpr, thresholds = roc_curve(y, probs)

The AUC for the ROC can be calculated using the *roc_auc_score()* function.

Like the *roc_curve()* function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class. It returns the AUC score between 0.0 and 1.0 for no skill and perfect skill respectively.

# calculate AUC auc = roc_auc_score(y, probs) print('AUC: %.3f' % auc)

A complete example of calculating the ROC curve and AUC for a logistic regression model on a small test problem is listed below.

# roc curve and auc from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC auc = roc_auc_score(testy, probs) print('AUC: %.3f' % auc) # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, probs) # plot no skill pyplot.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model pyplot.plot(fpr, tpr, marker='.') # show the plot pyplot.show()

Running the example prints the area under the ROC curve.

AUC: 0.895

A plot of the ROC curve for the model is also created showing that the model has skill.

There are many ways to evaluate the skill of a prediction model.

An approach in the related field of information retrieval (finding documents based on queries) measures precision and recall.

These measures are also useful in applied machine learning for evaluating binary classification models.

Precision is a ratio of the number of true positives divided by the sum of the true positives and false positives. It describes how good a model is at predicting the positive class. Precision is referred to as the positive predictive value.

Positive Predictive Power = True Positives / (True Positives + False Positives)

or

Precision = True Positives / (True Positives + False Positives)

Recall is calculated as the ratio of the number of true positives divided by the sum of the true positives and the false negatives. Recall is the same as sensitivity.

Recall = True Positives / (True Positives + False Negatives)

or

Sensitivity = True Positives / (True Positives + False Negatives)

Recall == Sensitivity

Reviewing both precision and recall is useful in cases where there is an imbalance in the observations between the two classes. Specifically, there are many examples of no event (class 0) and only a few examples of an event (class 1).

The reason for this is that typically the large number of class 0 examples means we are less interested in the skill of the model at predicting class 0 correctly, e.g. high true negatives.

Key to the calculation of precision and recall is that the calculations do not make use of the true negatives. It is only concerned with the correct prediction of the minority class, class 1.

A precision-recall curve is a plot of the precision (y-axis) and the recall (x-axis) for different thresholds, much like the ROC curve.

The no-skill line is defined by the total number of positive cases divide by the total number of positive and negative cases. For a dataset with an equal number of positive and negative cases, this is a straight line at 0.5. Points above this line show skill.

A model with perfect skill is depicted as a point at [1.0,1.0]. A skilful model is represented by a curve that bows towards [1.0,1.0] above the flat line of no skill.

There are also composite scores that attempt to summarize the precision and recall; three examples include:

**F score**or F1 score: that calculates the harmonic mean of the precision and recall (harmonic mean because the precision and recall are ratios).**Average precision**: that summarizes the weighted increase in precision with each change in recall for the thresholds in the precision-recall curve.**Area Under Curve**: like the AUC, summarizes the integral or an approximation of the area under the precision-recall curve.

In terms of model selection, F1 summarizes model skill for a specific probability threshold, whereas average precision and area under curve summarize the skill of a model across thresholds, like ROC AUC.

This makes precision-recall and a plot of precision vs. recall and summary measures useful tools for binary classification problems that have an imbalance in the observations for each class.

Precision and recall can be calculated in scikit-learn via the *precision_score()* and *recall_score()* functions.

The precision and recall can be calculated for thresholds using the *precision_recall_curve()* function that takes the true output values and the probabilities for the positive class as output and returns the precision, recall and threshold values.

# calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs)

The F1 score can be calculated by calling the *f1_score()* function that takes the true class values and the predicted class values as arguments.

# calculate F1 score f1 = f1_score(testy, yhat)

The area under the precision-recall curve can be approximated by calling the *auc()* function and passing it the recall and precision values calculated for each threshold.

# calculate precision-recall AUC auc = auc(recall, precision)

Finally, the average precision can be calculated by calling the *average_precision_score()* function and passing it the true class values and the predicted class values.

The complete example is listed below.

# precision-recall curve and f1 from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc from sklearn.metrics import average_precision_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[1,1], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # predict class values yhat = model.predict(testX) # calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) # calculate F1 score f1 = f1_score(testy, yhat) # calculate precision-recall AUC auc = auc(recall, precision) # calculate average precision score ap = average_precision_score(testy, probs) print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap)) # plot no skill pyplot.plot([0, 1], [0.5, 0.5], linestyle='--') # plot the roc curve for the model pyplot.plot(recall, precision, marker='.') # show the plot pyplot.show()

Running the example first prints the F1, area under curve (AUC) and average precision (AP) scores.

f1=0.836 auc=0.892 ap=0.840

The precision-recall curve plot is then created showing the precision/recall for each threshold compared to a no skill model.

Generally, the use of ROC curves and precision-recall curves are as follows:

- ROC curves should be used when there are roughly equal numbers of observations for each class.
- Precision-Recall curves should be used when there is a moderate to large class imbalance.

The reason for this recommendation is that ROC curves present an optimistic picture of the model on datasets with a class imbalance.

However, ROC curves can present an overly optimistic view of an algorithm’s performance if there is a large skew in the class distribution. […] Precision-Recall (PR) curves, often used in Information Retrieval , have been cited as an alternative to ROC curves for tasks with a large skew in the class distribution.

— The Relationship Between Precision-Recall and ROC Curves, 2006.

Some go further and suggest that using a ROC curve with an imbalanced dataset might be deceptive and lead to incorrect interpretations of the model skill.

[…] the visual interpretability of ROC plots in the context of imbalanced datasets can be deceptive with respect to conclusions about the reliability of classification performance, owing to an intuitive but wrong interpretation of specificity. [Precision-recall curve] plots, on the other hand, can provide the viewer with an accurate prediction of future classification performance due to the fact that they evaluate the fraction of true positives among positive predictions

The main reason for this optimistic picture is because of the use of true negatives in the False Positive Rate in the ROC Curve and the careful avoidance of this rate in the Precision-Recall curve.

If the proportion of positive to negative instances changes in a test set, the ROC curves will not change. Metrics such as accuracy, precision, lift and F scores use values from both columns of the confusion matrix. As a class distribution changes these measures will change as well, even if the fundamental classifier performance does not. ROC graphs are based upon TP rate and FP rate, in which each dimension is a strict columnar ratio, so do not depend on class distributions.

— ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, 2003.

We can make this concrete with a short example.

Below is the same ROC Curve example with a modified problem where there is a 10:1 ratio of class=0 to class=1 observations.

# roc curve and auc on imbalanced dataset from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9,0.09], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # calculate AUC auc = roc_auc_score(testy, probs) print('AUC: %.3f' % auc) # calculate roc curve fpr, tpr, thresholds = roc_curve(testy, probs) # plot no skill pyplot.plot([0, 1], [0, 1], linestyle='--') # plot the roc curve for the model pyplot.plot(fpr, tpr, marker='.') # show the plot pyplot.show()

Running the example suggests that the model has skill.

AUC: 0.713

Indeed, it has skill, but much of that skill is measured as making correct false negative predictions and there are a lot of false negative predictions to make.

A plot of the ROC Curve confirms the AUC interpretation of a skilful model for most probability thresholds.

We can also repeat the test of the same model on the same dataset and calculate a precision-recall curve and statistics instead.

The complete example is listed below.

# precision-recall curve and auc from sklearn.datasets import make_classification from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import precision_recall_curve from sklearn.metrics import f1_score from sklearn.metrics import auc from sklearn.metrics import average_precision_score from matplotlib import pyplot # generate 2 class dataset X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.9,0.09], random_state=1) # split into train/test sets trainX, testX, trainy, testy = train_test_split(X, y, test_size=0.5, random_state=2) # fit a model model = KNeighborsClassifier(n_neighbors=3) model.fit(trainX, trainy) # predict probabilities probs = model.predict_proba(testX) # keep probabilities for the positive outcome only probs = probs[:, 1] # predict class values yhat = model.predict(testX) # calculate precision-recall curve precision, recall, thresholds = precision_recall_curve(testy, probs) # calculate F1 score f1 = f1_score(testy, yhat) # calculate precision-recall AUC auc = auc(recall, precision) # calculate average precision score ap = average_precision_score(testy, probs) print('f1=%.3f auc=%.3f ap=%.3f' % (f1, auc, ap)) # plot no skill pyplot.plot([0, 1], [0.1, 0.1], linestyle='--') # plot the roc curve for the model pyplot.plot(recall, precision, marker='.') # show the plot pyplot.show()

Running the example first prints the F1, AUC and AP scores.

The scores do not look encouraging, given skilful models are generally above 0.5.

f1=0.278 auc=0.302 ap=0.236

From the plot, we can see that after precision and recall crash fast.

This section provides more resources on the topic if you are looking to go deeper.

- A critical investigation of recall and precision as measures of retrieval system performance, 1989.
- The Relationship Between Precision-Recall and ROC Curves, 2006.
- The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets, 2015.
- ROC Graphs: Notes and Practical Considerations for Data Mining Researchers, 2003.

- sklearn.metrics.roc_curve API
- sklearn.metrics.roc_auc_score API
- sklearn.metrics.precision_recall_curve API
- sklearn.metrics.auc API
- sklearn.metrics.average_precision_score API
- Precision-Recall, scikit-learn
- Precision, recall and F-measures, scikit-learn

- Receiver operating characteristic on Wikipedia
- Sensitivity and specificity on Wikipedia
- Precision and recall on Wikipedia
- Information retrieval on Wikipedia
- F1 score on Wikipedia
- ROC and precision-recall with imbalanced datasets, blog.

In this tutorial, you discovered ROC Curves, Precision-Recall Curves, and when to use each to interpret the prediction of probabilities for binary classification problems.

Specifically, you learned:

- ROC Curves summarize the trade-off between the true positive rate and false positive rate for a predictive model using different probability thresholds.
- Precision-Recall curves summarize the trade-off between the true positive rate and the positive predictive value for a predictive model using different probability thresholds.
- ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How and When to Use ROC Curves and Precision-Recall Curves for Classification in Python appeared first on Machine Learning Mastery.

]]>The post How to Make Predictions with scikit-learn appeared first on Machine Learning Mastery.

]]>with scikit-learn models in Python.

Once you choose and fit a final machine learning model in scikit-learn, you can use it to make predictions on new data instances.

There is some confusion amongst beginners about how exactly to do this. I often see questions such as:

How do I make predictions with my model in scikit-learn?

In this tutorial, you will discover exactly how you can make classification and regression predictions with a finalized machine learning model in the scikit-learn Python library.

After completing this tutorial, you will know:

- How to finalize a model in order to make it ready for making predictions.
- How to make class and probability predictions in scikit-learn.
- How to make regression predictions in scikit-learn.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- First Finalize Your Model
- How to Predict With Classification Models
- How to Predict With Regression Models

Before you can make predictions, you must train a final model.

You may have trained models using k-fold cross validation or train/test splits of your data. This was done in order to give you an estimate of the skill of the model on out-of-sample data, e.g. new data.

These models have served their purpose and can now be discarded.

You now must train a final model on all of your available data.

You can learn more about how to train a final model here:

Classification problems are those where the model learns a mapping between input features and an output feature that is a label, such as “*spam*” and “*not spam*.”

Below is sample code of a finalized LogisticRegression model for a simple binary classification problem.

Although we are using *LogisticRegression* in this tutorial, the same functions are available on practically all classification algorithms in scikit-learn.

# example of training a final classification model from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y)

After finalizing your model, you may want to save the model to file, e.g. via pickle. Once saved, you can load the model any time and use it to make predictions. For an example of this, see the post:

For simplicity, we will skip this step for the examples in this tutorial.

There are two types of classification predictions we may wish to make with our finalized model; they are class predictions and probability predictions.

A class prediction is: given the finalized model and one or more data instances, predict the class for the data instances.

We do not know the outcome classes for the new data. That is why we need the model in the first place.

We can predict the class for new data instances using our finalized classification model in scikit-learn using the *predict()* function.

For example, we have one or more data instances in an array called *Xnew*. This can be passed to the *predict()* function on our model in order to predict the class values for each instance in the array.

Xnew = [[...], [...]] ynew = model.predict(Xnew)

Let’s make this concrete with an example of predicting multiple data instances at once.

# example of training a final classification model from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) # new instances where we do not know the answer Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1) # make a prediction ynew = model.predict(Xnew) # show the inputs and predicted outputs for i in range(len(Xnew)): print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

Running the example predicts the class for the three new data instances, then prints the data and the predictions together.

X=[-0.79415228 2.10495117], Predicted=0 X=[-8.25290074 -4.71455545], Predicted=1 X=[-2.18773166 3.33352125], Predicted=0

If you had just one new data instance, you can provide this as instance wrapped in an array to the *predict()* function; for example:

# example of making a single class prediction from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) # define one new instance Xnew = [[-0.79415228, 2.10495117]] # make a prediction ynew = model.predict(Xnew) print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))

Running the example prints the single instance and the predicted class.

X=[-0.79415228, 2.10495117], Predicted=0

When you prepared your data, you will have mapped the class values from your domain (such as strings) to integer values. You may have used a LabelEncoder.

This *LabelEncoder* can be used to convert the integers back into string values via the *inverse_transform()* function.

For this reason, you may want to save (pickle) the *LabelEncoder* used to encode your y values when fitting your final model.

Another type of prediction you may wish to make is the probability of the data instance belonging to each class.

This is called a probability prediction where given a new instance, the model returns the probability for each outcome class as a value between 0 and 1.

You can make these types of predictions in scikit-learn by calling the *predict_proba()* function, for example:

Xnew = [[...], [...]] ynew = model.predict_proba(Xnew)

This function is only available on those classification models capable of making a probability prediction, which is most, but not all, models.

The example below makes a probability prediction for each example in the *Xnew* array of data instance.

# example of making multiple probability predictions from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) # new instances where we do not know the answer Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1) # make a prediction ynew = model.predict_proba(Xnew) # show the inputs and predicted probabilities for i in range(len(Xnew)): print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

Running the instance makes the probability predictions and then prints the input data instance and the probability of each instance belonging to the first and second classes (0 and 1).

X=[-0.79415228 2.10495117], Predicted=[0.94556472 0.05443528] X=[-8.25290074 -4.71455545], Predicted=[3.60980873e-04 9.99639019e-01] X=[-2.18773166 3.33352125], Predicted=[0.98437415 0.01562585]

This can be helpful in your application if you want to present the probabilities to the user for expert interpretation.

Regression is a supervised learning problem where, given input examples, the model learns a mapping to suitable output quantities, such as “0.1” and “0.2”, etc.

Below is an example of a finalized *LinearRegression* model. Again, the functions demonstrated for making regression predictions apply to all of the regression models available in scikit-learn.

# example of training a final regression model from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # generate regression dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=1) # fit final model model = LinearRegression() model.fit(X, y)

We can predict quantities with the finalized regression model by calling the *predict()* function on the finalized model.

As with classification, the predict() function takes a list or array of one or more data instances.

The example below demonstrates how to make regression predictions on multiple data instances with an unknown expected outcome.

# example of training a final regression model from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # generate regression dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # fit final model model = LinearRegression() model.fit(X, y) # new instances where we do not know the answer Xnew, _ = make_regression(n_samples=3, n_features=2, noise=0.1, random_state=1) # make a prediction ynew = model.predict(Xnew) # show the inputs and predicted outputs for i in range(len(Xnew)): print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

Running the example makes multiple predictions, then prints the inputs and predictions side-by-side for review.

X=[-1.07296862 -0.52817175], Predicted=-61.32459258381131 X=[-0.61175641 1.62434536], Predicted=-30.922508147981667 X=[-2.3015387 0.86540763], Predicted=-127.34448527071137

The same function can be used to make a prediction for a single data instance as long as it is suitably wrapped in a surrounding list or array.

For example:

# example of training a final regression model from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # generate regression dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # fit final model model = LinearRegression() model.fit(X, y) # define one new data instance Xnew = [[-1.07296862, -0.52817175]] # make a prediction ynew = model.predict(Xnew) # show the inputs and predicted outputs print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))

Running the example makes a single prediction and prints the data instance and prediction for review.

X=[-1.07296862, -0.52817175], Predicted=-77.17947088762787

This section provides more resources on the topic if you are looking to go deeper.

- How to Train a Final Machine Learning Model
- Save and Load Machine Learning Models in Python with scikit-learn
- scikit-learn API Reference

In this tutorial, you discovered how you can make classification and regression predictions with a finalized machine learning model in the scikit-learn Python library.

Specifically, you learned:

- How to finalize a model in order to make it ready for making predictions.
- How to make class and probability predictions in scikit-learn.
- How to make regression predictions in scikit-learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Make Predictions with scikit-learn appeared first on Machine Learning Mastery.

]]>The post How to Generate Test Datasets in Python with scikit-learn appeared first on Machine Learning Mastery.

]]>The data from test datasets have well-defined properties, such as linearly or non-linearity, that allow you to explore specific algorithm behavior. The scikit-learn Python library provides a suite of functions for generating samples from configurable test problems for regression and classification.

In this tutorial, you will discover test problems and how to use them in Python with scikit-learn.

After completing this tutorial, you will know:

- How to generate multi-class classification prediction test problems.
- How to generate binary classification prediction test problems.
- How to generate linear regression prediction test problems.

Let’s get started.

This tutorial is divided into 3 parts; they are:

- Test Datasets
- Classification Test Problems
- Regression Test Problems

A problem when developing and implementing machine learning algorithms is how do you know whether you have implemented them correctly. They seem to work even with bugs.

Test datasets are small contrived problems that allow you to test and debug your algorithms and test harness. They are also useful for better understanding the behavior of algorithms in response to changes in hyperparameters.

Below are some desirable properties of test datasets:

- They can be generated quickly and easily.
- They contain “known” or “understood” outcomes for comparison with predictions.
- They are stochastic, allowing random variations on the same problem each time they are generated.
- They are small and easily visualized in two dimensions.
- They can be scaled up trivially.

I recommend using test datasets when getting started with a new machine learning algorithm or when developing a new test harness.

scikit-learn is a Python library for machine learning that provides functions for generating a suite of test problems.

In this tutorial, we will look at some examples of generating test problems for classification and regression algorithms.

Classification is the problem of assigning labels to observations.

In this section, we will look at three classification problems: blobs, moons and circles.

The make_blobs() function can be used to generate blobs of points with a Gaussian distribution.

You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties.

The problem is suitable for linear classification problems given the linearly separable nature of the blobs.

The example below generates a 2D dataset of samples with three blobs as a multi-class classification prediction problem. Each observation has two inputs and 0, 1, or 2 class values.

# generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=3, n_features=2)

The complete example is listed below.

from sklearn.datasets.samples_generator import make_blobs from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=3, n_features=2) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue', 2:'green'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example generates the inputs and outputs for the problem and then creates a handy 2D plot showing points for the different classes using different colors.

Note, your specific dataset and resulting plot will vary given the stochastic nature of the problem generator. This is a feature, not a bug.

We will use this same example structure for the following examples.

The make_moons() function is for binary classification and will generate a swirl pattern, or two moons.

You can control how noisy the moon shapes are and the number of samples to generate.

This test problem is suitable for algorithms that are capable of learning nonlinear class boundaries.

The example below generates a moon dataset with moderate noise.

# generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.1)

The complete example is listed below.

from sklearn.datasets import make_moons from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_moons(n_samples=100, noise=0.1) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example generates and plots the dataset for review, again coloring samples by their assigned class.

The make_circles() function generates a binary classification problem with datasets that fall into concentric circles.

Again, as with the moons test problem, you can control the amount of noise in the shapes.

This test problem is suitable for algorithms that can learn complex non-linear manifolds.

The example below generates a circles dataset with some noise.

# generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.05)

The complete example is listed below.

from sklearn.datasets import make_circles from matplotlib import pyplot from pandas import DataFrame # generate 2d classification dataset X, y = make_circles(n_samples=100, noise=0.05) # scatter plot, dots colored by class value df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y)) colors = {0:'red', 1:'blue'} fig, ax = pyplot.subplots() grouped = df.groupby('label') for key, group in grouped: group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key]) pyplot.show()

Running the example generates and plots the dataset for review.

Regression is the problem of predicting a quantity given an observation.

The make_regression() function will create a dataset with a linear relationship between inputs and the outputs.

You can configure the number of samples, number of input features, level of noise, and much more.

This dataset is suitable for algorithms that can learn a linear regression function.

The example below will generate 100 examples with one input feature and one output feature with modest noise.

# generate regression dataset X, y = make_regression(n_samples=100, n_features=1, noise=0.1)

The complete example is listed below.

from sklearn.datasets import make_regression from matplotlib import pyplot # generate regression dataset X, y = make_regression(n_samples=100, n_features=1, noise=0.1) # plot regression dataset pyplot.scatter(X,y) pyplot.show()

Running the example will generate the data and plot the X and y relationship, which, given that it is linear, is quite boring.

This section lists some ideas for extending the tutorial that you may wish to explore.

**Compare Algorithms**. Select a test problem and compare a suite of algorithms on the problem and report the performance.**Scale Up Problem**. Select a test problem and explore scaling it up, use progression methods to visualize the results, and perhaps explore model skill vs problem scale for a given algorithm.**Additional Problems**. The library provides a suite of additional test problems; write a code example for each to demonstrate how they work.

If you explore any of these extensions, I’d love to know.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered test problems and how to use them in Python with scikit-learn.

Specifically, you learned:

- How to generate multi-class classification prediction test problems.
- How to generate binary classification prediction test problems.
- How to generate linear regression prediction test problems.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

The post How to Generate Test Datasets in Python with scikit-learn appeared first on Machine Learning Mastery.

]]>The post How to Handle Missing Data with Python appeared first on Machine Learning Mastery.

]]>Data can have missing values for a number of reasons such as observations that were not recorded and data corruption.

Handling missing data is important as many machine learning algorithms do not support data with missing values.

In this tutorial, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this tutorial you will know:

- How to marking invalid or corrupt values as missing in your dataset.
- How to remove rows with missing data from your dataset.
- How to impute missing values with mean values in your dataset.

Let’s get started.

**Note**: The examples in this post assume that you have Python 2 or 3 with Pandas, NumPy and Scikit-Learn installed, specifically scikit-learn version 0.18 or higher.

**Update March/2018**: Added alternate link to download the dataset as the original appears to have been taken down.

This tutorial is divided into 6 parts:

**Pima Indians Diabetes Dataset**: where we look at a dataset that has known missing values.**Mark Missing Values**: where we learn how to mark missing values in a dataset.**Missing Values Causes Problems**: where we see how a machine learning algorithm can fail when it contains missing values.**Remove Rows With Missing Values**: where we see how to remove rows that contain missing values.**Impute Missing Values**: where we replace missing values with sensible values.**Algorithms that Support Missing Values**: where we learn about algorithms that support missing values.

First, let’s take a look at our sample dataset with missing values.

The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. The variable names are as follows:

- 0. Number of times pregnant.
- 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- 2. Diastolic blood pressure (mm Hg).
- 3. Triceps skinfold thickness (mm).
- 4. 2-Hour serum insulin (mu U/ml).
- 5. Body mass index (weight in kg/(height in m)^2).
- 6. Diabetes pedigree function.
- 7. Age (years).
- 8. Class variable (0 or 1).

The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.

A sample of the first 5 rows is listed below.

6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1

This dataset is known to have missing values.

Specifically, there are missing observations for some columns that are marked as a zero value.

We can corroborate this by the definition of those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a zero for body mass index or blood pressure is invalid.

Download the dataset from here and save it to your current working directory with the file name *pima-indians-diabetes.csv *(update: download from here).

In this section, we will look at how we can identify and mark values as missing.

We can use plots and summary statistics to help identify missing or corrupt data.

We can load the dataset as a Pandas DataFrame and print summary statistics on each attribute.

from pandas import read_csv dataset = read_csv('pima-indians-diabetes.csv', header=None) print(dataset.describe())

Running this example produces the following output:

0 1 2 3 4 5 \ count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 6 7 8 count 768.000000 768.000000 768.000000 mean 0.471876 33.240885 0.348958 std 0.331329 11.760232 0.476951 min 0.078000 21.000000 0.000000 25% 0.243750 24.000000 0.000000 50% 0.372500 29.000000 0.000000 75% 0.626250 41.000000 1.000000 max 2.420000 81.000000 1.000000

This is useful.

We can see that there are columns that have a minimum value of zero (0). On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns have an invalid zero minimum value:

- 1: Plasma glucose concentration
- 2: Diastolic blood pressure
- 3: Triceps skinfold thickness
- 4: 2-Hour serum insulin
- 5: Body mass index

Let’ confirm this my looking at the raw data, the example prints the first 20 rows of data.

from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # print the first 20 rows of data print(dataset.head(20))

Running the example, we can clearly see 0 values in the columns 2, 3, 4, and 5.

0 1 2 3 4 5 6 7 8 0 6 148 72 35 0 33.6 0.627 50 1 1 1 85 66 29 0 26.6 0.351 31 0 2 8 183 64 0 0 23.3 0.672 32 1 3 1 89 66 23 94 28.1 0.167 21 0 4 0 137 40 35 168 43.1 2.288 33 1 5 5 116 74 0 0 25.6 0.201 30 0 6 3 78 50 32 88 31.0 0.248 26 1 7 10 115 0 0 0 35.3 0.134 29 0 8 2 197 70 45 543 30.5 0.158 53 1 9 8 125 96 0 0 0.0 0.232 54 1 10 4 110 92 0 0 37.6 0.191 30 0 11 10 168 74 0 0 38.0 0.537 34 1 12 10 139 80 0 0 27.1 1.441 57 0 13 1 189 60 23 846 30.1 0.398 59 1 14 5 166 72 19 175 25.8 0.587 51 1 15 7 100 0 0 0 30.0 0.484 32 1 16 0 118 84 47 230 45.8 0.551 31 1 17 7 107 74 0 0 29.6 0.254 31 1 18 1 103 30 38 83 43.3 0.183 33 0 19 1 115 70 30 96 34.6 0.529 32 1

We can get a count of the number of missing values on each of these columns. We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

We can do this my marking all of the values in the subset of the DataFrame we are interested in that have zero values as True. We can then count the number of true values in each column.

from pandas import read_csv dataset = read_csv('pima-indians-diabetes.csv', header=None) print((dataset[[1,2,3,4,5]] == 0).sum())

Running the example prints the following output:

1 5 2 35 3 227 4 374 5 11

We can see that columns 1,2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows.

This highlights that different “missing value” strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN.

Values with a NaN value are ignored from operations like sum, count, etc.

We can mark values as NaN easily with the Pandas DataFrame by using the replace() function on a subset of the columns we are interested in.

After we have marked the missing values, we can use the isnull() function to mark all of the NaN values in the dataset as True and get a count of the missing values for each column.

from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # count the number of NaN values in each column print(dataset.isnull().sum())

Running the example prints the number of missing values in each column. We can see that the columns 1:5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

We can see that the columns 1 to 5 have the same number of missing values as zero values identified above. This is a sign that we have marked the identified missing values correctly.

0 0 1 5 2 35 3 227 4 374 5 11 6 0 7 0 8 0

This is a useful summary. I always like to look at the actual data though, to confirm that I have not fooled myself.

Below is the same example, except we print the first 20 rows of data.

from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # print the first 20 rows of data print(dataset.head(20))

Running the example, we can clearly see NaN values in the columns 2, 3, 4 and 5. There are only 5 missing values in column 1, so it is not surprising we did not see an example in the first 20 rows.

It is clear from the raw data that marking the missing values had the intended effect.

0 1 2 3 4 5 6 7 8 0 6 148.0 72.0 35.0 NaN 33.6 0.627 50 1 1 1 85.0 66.0 29.0 NaN 26.6 0.351 31 0 2 8 183.0 64.0 NaN NaN 23.3 0.672 32 1 3 1 89.0 66.0 23.0 94.0 28.1 0.167 21 0 4 0 137.0 40.0 35.0 168.0 43.1 2.288 33 1 5 5 116.0 74.0 NaN NaN 25.6 0.201 30 0 6 3 78.0 50.0 32.0 88.0 31.0 0.248 26 1 7 10 115.0 NaN NaN NaN 35.3 0.134 29 0 8 2 197.0 70.0 45.0 543.0 30.5 0.158 53 1 9 8 125.0 96.0 NaN NaN NaN 0.232 54 1 10 4 110.0 92.0 NaN NaN 37.6 0.191 30 0 11 10 168.0 74.0 NaN NaN 38.0 0.537 34 1 12 10 139.0 80.0 NaN NaN 27.1 1.441 57 0 13 1 189.0 60.0 23.0 846.0 30.1 0.398 59 1 14 5 166.0 72.0 19.0 175.0 25.8 0.587 51 1 15 7 100.0 NaN NaN NaN 30.0 0.484 32 1 16 0 118.0 84.0 47.0 230.0 45.8 0.551 31 1 17 7 107.0 74.0 NaN NaN 29.6 0.254 31 1 18 1 103.0 30.0 38.0 83.0 43.3 0.183 33 0 19 1 115.0 70.0 30.0 96.0 34.6 0.529 32 1

Before we look at handling missing values, let’s first demonstrate that having missing values in a dataset can cause problems.

Having missing values in a dataset can cause errors with some machine learning algorithms.

In this section, we will try to evaluate a the Linear Discriminant Analysis (LDA) algorithm on the dataset with missing values.

This is an algorithm that does not work when there are missing values in the dataset.

The below example marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross validation and print the mean accuracy.

from pandas import read_csv import numpy from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # split dataset into inputs and outputs values = dataset.values X = values[:,0:8] y = values[:,8] # evaluate an LDA model on the dataset using k-fold cross validation model = LinearDiscriminantAnalysis() kfold = KFold(n_splits=3, random_state=7) result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy') print(result.mean())

Running the example results in an error, as follows:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

This is as we expect.

We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values.

Now, we can look at methods to handle the missing values.

The simplest strategy for handling missing data is to remove records that contain a missing value.

We can do this by creating a new Pandas DataFrame with the rows containing missing values removed.

Pandas provides the dropna() function that can be used to drop either columns or rows with missing data. We can use dropna() to remove all rows with missing data, as follows:

from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # drop rows with missing values dataset.dropna(inplace=True) # summarize the number of rows and columns in the dataset print(dataset.shape)

Running this example, we can see that the number of rows has been aggressively cut from 768 in the original dataset to 392 with all rows containing a NaN removed.

(392, 9)

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.

from pandas import read_csv import numpy from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # drop rows with missing values dataset.dropna(inplace=True) # split dataset into inputs and outputs values = dataset.values X = values[:,0:8] y = values[:,8] # evaluate an LDA model on the dataset using k-fold cross validation model = LinearDiscriminantAnalysis() kfold = KFold(n_splits=3, random_state=7) result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy') print(result.mean())

The example runs successfully and prints the accuracy of the model.

0.78582892934

Removing rows with missing values can be too limiting on some predictive modeling problems, an alternative is to impute missing values.

Imputing refers to using a model to replace missing values.

There are many options we could consider when replacing a missing value, for example:

- A constant value that has meaning within the domain, such as 0, distinct from all other values.
- A value from another randomly selected record.
- A mean, median or mode value for the column.
- A value estimated by another predictive model.

Any imputing performed on the training dataset will have to be performed on new data in the future when predictions are needed from the finalized model. This needs to be taken into consideration when choosing how to impute the missing values.

For example, if you choose to impute with mean column values, these mean column values will need to be stored to file for later use on new data that has missing values.

Pandas provides the fillna() function for replacing missing values with a specific value.

For example, we can use fillna() to replace missing values with the mean value for each column, as follows:

from pandas import read_csv import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values dataset.fillna(dataset.mean(), inplace=True) # count the number of NaN values in each column print(dataset.isnull().sum())

Running the example provides a count of the number of missing values in each column, showing zero missing values.

0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0

The scikit-learn library provides the Imputer() pre-processing class that can be used to replace missing values.

It is a flexible class that allows you to specify the value to replace (it can be something other than NaN) and the technique used to replace it (such as mean, median, or mode). The Imputer class operates directly on the NumPy array instead of the DataFrame.

The example below uses the Imputer class to replace missing values with the mean of each column then prints the number of NaN values in the transformed matrix.

from pandas import read_csv from sklearn.preprocessing import Imputer import numpy dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values values = dataset.values imputer = Imputer() transformed_values = imputer.fit_transform(values) # count the number of NaN values in each column print(numpy.isnan(transformed_values).sum())

Running the example shows that all NaN values were imputed successfully.

In either case, we can train algorithms sensitive to NaN values in the transformed dataset, such as LDA.

The example below shows the LDA algorithm trained in the Imputer transformed dataset.

from pandas import read_csv import numpy from sklearn.preprocessing import Imputer from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score dataset = read_csv('pima-indians-diabetes.csv', header=None) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # split dataset into inputs and outputs values = dataset.values X = values[:,0:8] y = values[:,8] # fill missing values with mean column values imputer = Imputer() transformed_X = imputer.fit_transform(X) # evaluate an LDA model on the dataset using k-fold cross validation model = LinearDiscriminantAnalysis() kfold = KFold(n_splits=3, random_state=7) result = cross_val_score(model, transformed_X, y, cv=kfold, scoring='accuracy') print(result.mean())

Running the example prints the accuracy of LDA on the transformed dataset.

0.766927083333

Try replacing the missing values with other values and see if you can lift the performance of the model.

Maybe missing values have meaning in the data.

Next we will look at using algorithms that treat missing values as just another value when modeling.

Not all algorithms fail when there is missing data.

There are algorithms that can be made robust to missing data, such as k-Nearest Neighbors that can ignore a column from a distance measure when a value is missing.

There are also algorithms that can use the missing value as a unique and different value when building the predictive model, such as classification and regression trees.

Sadly, the scikit-learn implementations of decision trees and k-Nearest Neighbors are not robust to missing values. Although it is being considered.

Nevertheless, this remains as an option if you consider using another algorithm implementation (such as xgboost) or developing your own implementation.

In this tutorial, you discovered how to handle machine learning data that contains missing values.

Specifically, you learned:

- How to mark missing values in a dataset as numpy.nan.
- How to remove rows from the dataset that contain missing values.
- How to replace missing values with sensible values.

Do you have any questions about handling missing values?

Ask your questions in the comments and I will do my best to answer.

The post How to Handle Missing Data with Python appeared first on Machine Learning Mastery.

]]>The post How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning appeared first on Machine Learning Mastery.

]]>Python itself must be installed first, and then there are many packages to install, and it can be confusing for beginners.

In this tutorial, you will discover how to setup a Python 3 machine learning and deep learning development environment using macports.

After completing this tutorial, you will have a working Python 3 environment to begin learning, practicing, and developing machine learning and deep learning software.

Let’s get started.

**Update Aug/2017**: Added a section on how to keep your environment up to date.

This tutorial is broken down into the following 4 steps:

- Install XCode Tools
- Install Macports
- Install SciPy Libraries
- Install Deep Learning Libraries
- Keep Your Environment Up-to-Date

XCode is the IDE for development on OS X.

Installation of XCode is required because it contains command line tools needed for Python development. In this step, you will install XCode and the XCode command line tools.

This step assumes you already have an Apple App Store account and that you have sufficient administrative privileges to install software on your workstation.

- 1. Open the “
*App Store*” application. Search for “*XCode*” and click the “*Get*” button to install.

You will be prompted to enter your App Store password.

XCode is free and is at least 4.5 GB in size and may take some time to download.

- 2. Open “
*Applications*” and then locate and start “*XCode*“.

You may be prompted with a message to install additional components before XCode can be started. Agree and install.

- 3. Install the XCode Command Line Tools, Open a terminal window and type:

xcode-select --install

A dialog will appear and install required tools.

Confirm the tools are installed by typing:

xcode-select -p

You should see output like:

/Applications/Xcode.app/Contents/Developer

- 4. Agree to the license agreement (if needed). Open a terminal window and type:

sudo xcodebuild -license

Use the “*space*” key to navigate to the bottom and agree.

You now have XCode and the XCode Command Line Tools installed.

Macports is a package management tool for installing development tools on OS X.

In this step, you will install the macports package management tool.

- 1. Visit macports.org
- 2. Click the “
*Download*” button at the top of the page to access the install page. - 3. Download the “
*macOS Package (.pkg) Installer*” for your version of OS X.

At the time of writing, the latest version of OS X is Sierra.

You should now have a package on your workstation. For example:

MacPorts-2.3.5-10.12-Sierra.pkg

- 4. Double click the package and follow through the wizard to install macports.

- 5. Update macports and confirm the system is working as expected. Open a terminal window and type:

sudo port selfupdate

This will update the port command and the list of available ports and is useful to do from time to time.

You should see a message like:

MacPorts base is already the latest version

SciPy is the collection of scientific computing Python libraries needed for machine learning development in Python.

In this step, you will install the Python 3 and SciPy environment.

- 1. Install Python version 3.5 using macports. Open a terminal and type:

sudo port install python35

To make this the default version of Python, type:

sudo port select --set python python35 sudo port select --set python3 python35

Close the terminal window and reopen it.

Confirm that Python 3.5 is now the default Python for the system by typing:

python -V

You should see the message below, or similar:

Python 3.5.3

- 2. Install the SciPy environment, including the libraries:
- NumPy
- SciPy
- Matplotlib
- Pandas
- Statsmodels
- Pip (package manager)

Open a terminal and type:

sudo port install py35-numpy py35-scipy py35-matplotlib py35-pandas py35-statsmodels py35-pip

This may take some time to download and install.

To ensure pip for Python 3 is the default for the system, type:

sudo port select --set pip pip35

- 3. Install scikit-learn using pip. Open the command line and type:

sudo pip install -U scikit-learn

- 4. Confirm the libraries were installed correctly. Open a text editor and write (copy-paste) the following script:

# scipy import scipy print('scipy: %s' % scipy.__version__) # numpy import numpy print('numpy: %s' % numpy.__version__) # matplotlib import matplotlib print('matplotlib: %s' % matplotlib.__version__) # pandas import pandas print('pandas: %s' % pandas.__version__) # statsmodels import statsmodels print('statsmodels: %s' % statsmodels.__version__) # scikit-learn import sklearn print('sklearn: %s' % sklearn.__version__)

Save the script with the filename *versions.py*.

Change directory to the location where you saved the script and type:

python versions.py

The output should look like the following (or similar):

scipy: 0.18.1 numpy: 1.12.0 matplotlib: 2.0.0 pandas: 0.19.2 statsmodels: 0.6.1 sklearn: 0.18.1

What versions did you get?

Paste the output in the comments below.

You can use these commands to update machine learning and SciPy libraries as needed.

Try a scikit-learn tutorial, such as:

In this step, we will install Python libraries used for deep learning, specifically: Theano, TensorFlow, and Keras.

- 1. Install the Theano deep learning library by typing:

sudo pip install theano

- 2. Install the TensorFlow deep learning library by typing:

sudo pip install tensorflow

- 3. To install Keras, type:

sudo pip install keras

- 4. Confirm your deep learning environment is installed and working correctly.

Create a script that prints the version numbers of each library, as we did before for the SciPy environment.

# theano import theano print('theano: %s' % theano.__version__) # tensorflow import tensorflow print('tensorflow: %s' % tensorflow.__version__) # keras import keras print('keras: %s' % keras.__version__)

Save the script to a file *deep_versions.py*.

Run the script by typing:

python deep_versions.py

You should see output like:

theano: 0.8.2 tensorflow: 0.12.1 Using TensorFlow backend. keras: 1.2.1

What versions did you get?

Paste the output in the comments below.

Try a Keras deep learning tutorial, such as:

It is important to keep your environment up to date over time.

It is also important to use the same tools to update your libraries that were used to install the, e.g. macports and pip.

This section provides commands you can use, say once per month, to ensure that your environment is up to date.

The first step is to update macports itself.

sudo port selfupdate

Next, you can update the libraries that you installed using macports.

sudo port upgrade python35 py35-numpy py35-scipy py35-matplotlib py35-pandas py35-statsmodels py35-pip

You can also update all libraries that need an update by typing:

sudo port upgrade outdated

I don’t do this myself as I don’t have control over what is being updated.

Next, we can update the libraries installed with pip.

I don’t want pip to install or update things that can be installed with macports, so I update libraries installed with pip without updating dependencies (e.g. –no-deps)

sudo pip install -U --no-deps keras tensorflow theano scikit-learn

And that’s all you need to do to keep your environment up to date.

If you get some crossover between macports and pip, (e.g. numpy installed by both tools), you can get problems.

To see what exactly is installed with pip type:

sudo pip freeze

This section provides some resources for further reading.

- MacPorts Installation
- Chapter 2. Installing MacPorts
- Chapter 3. Using MacPorts
- Installing the SciPy Stack
- Scikit-learn Installation
- Installing Theano
- Install TensorFlow Anaconda
- Keras Installation

Congratulations, you now have a working Python development environment on Mac OS X for machine learning and deep learning.

You can now learn and practice machine learning and deep learning on your workstation.

How did you do?

Let me know in the comments below.

The post How to Install a Python 3 Environment on Mac OS X for Machine Learning and Deep Learning appeared first on Machine Learning Mastery.

]]>The post How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda appeared first on Machine Learning Mastery.

]]>Python itself must be installed first and then there are many packages to install, and it can be confusing for beginners.

In this tutorial, you will discover how to set up a Python machine learning development environment using Anaconda.

After completing this tutorial, you will have a working Python environment to begin learning, practicing, and developing machine learning and deep learning software.

These instructions are suitable for Windows, Mac OS X, and Linux platforms. I will demonstrate them on OS X, so you may see some mac dialogs and file extensions.

**Update Mar/2017**: Added note that you only need one of Theano or TensorFlow to use Kears for Deep Learning.

In this tutorial, we will cover the following steps:

- Download Anaconda
- Install Anaconda
- Start and Update Anaconda
- Update scikit-learn Library
- Install Deep Learning Libraries

In this step, we will download the Anaconda Python package for your platform.

Anaconda is a free and easy-to-use environment for scientific Python.

- 1. Visit the Anaconda homepage.
- 2. Click “Anaconda” from the menu and click “Download” to go to the download page.

- 3. Choose the download suitable for your platform (Windows, OSX, or Linux):
- Choose Python 3.5
- Choose the Graphical Installer

This will download the Anaconda Python package to your workstation.

I’m on OS X, so I chose the OS X version. The file is about 426 MB.

You should have a file with a name like:

Anaconda3-4.2.0-MacOSX-x86_64.pkg

In this step, we will install the Anaconda Python software on your system.

This step assumes you have sufficient administrative privileges to install software on your system.

- 1. Double click the downloaded file.
- 2. Follow the installation wizard.

Installation is quick and painless.

There should be no tricky questions or sticking points.

The installation should take less than 10 minutes and take up a little more than 1 GB of space on your hard drive.

In this step, we will confirm that your Anaconda Python environment is up to date.

Anaconda comes with a suite of graphical tools called Anaconda Navigator. You can start Anaconda Navigator by opening it from your application launcher.

You can learn all about the Anaconda Navigator here.

You can use the Anaconda Navigator and graphical development environments later; for now, I recommend starting with the Anaconda command line environment called conda.

Conda is fast, simple, it’s hard for error messages to hide, and you can quickly confirm your environment is installed and working correctly.

- 1. Open a terminal (command line window).
- 2. Confirm conda is installed correctly, by typing:

conda -V

You should see the following (or something similar):

conda 4.2.9

- 3. Confirm Python is installed correctly by typing:

python -V

You should see the following (or something similar):

Python 3.5.2 :: Anaconda 4.2.0 (x86_64)

If the commands do not work or have an error, please check the documentation for help for your platform.

See some of the resources in the “Further Reading” section.

- 4. Confirm your conda environment is up-to-date, type:

conda update conda conda update anaconda

You may need to install some packages and confirm the updates.

- 5. Confirm your SciPy environment.

The script below will print the version number of the key SciPy libraries you require for machine learning development, specifically: SciPy, NumPy, Matplotlib, Pandas, Statsmodels, and Scikit-learn.

You can type “python” and type the commands in directly. Alternatively, I recommend opening a text editor and copy-pasting the script into your editor.

# scipy import scipy print('scipy: %s' % scipy.__version__) # numpy import numpy print('numpy: %s' % numpy.__version__) # matplotlib import matplotlib print('matplotlib: %s' % matplotlib.__version__) # pandas import pandas print('pandas: %s' % pandas.__version__) # statsmodels import statsmodels print('statsmodels: %s' % statsmodels.__version__) # scikit-learn import sklearn print('sklearn: %s' % sklearn.__version__)

Save the script as a file with the name: *versions.py*.

On the command line, change your directory to where you saved the script and type:

python versions.py

You should see output like the following:

scipy: 0.18.1 numpy: 1.11.1 matplotlib: 1.5.3 pandas: 0.18.1 statsmodels: 0.6.1 sklearn: 0.17.1

What versions did you get?

Paste the output in the comments below.

In this step, we will update the main library used for machine learning in Python called scikit-learn.

- 1. Update scikit-learn to the latest version.

At the time of writing, the version of scikit-learn shipped with Anaconda is out of date (0.17.1 instead of 0.18.1). You can update a specific library using the conda command; below is an example of updating scikit-learn to the latest version.

At the terminal, type:

conda update scikit-learn

Alternatively, you can update a library to a specific version by typing:

conda install -c anaconda scikit-learn=0.18.1

Confirm the installation was successful and scikit-learn was updated by re-running the *versions.py* script by typing:

python versions.py

You should see output like the following:

scipy: 0.18.1 numpy: 1.11.3 matplotlib: 1.5.3 pandas: 0.18.1 statsmodels: 0.6.1 sklearn: 0.18.1

What versions did you get?

Paste the output in the comments below.

You can use these commands to update machine learning and SciPy libraries as needed.

Try a scikit-learn tutorial, such as:

In this step, we will install Python libraries used for deep learning, specifically: Theano, TensorFlow, and Keras.

**NOTE**: I recommend using Keras for deep learning and Keras only requires one of Theano or TensorFlow to be installed. You do not need both! There may be problems installing TensorFlow on some Windows machines.

- 1. Install the Theano deep learning library by typing:

conda install theano

- 2. Install the TensorFlow deep learning library (all except Windows) by typing:

conda install -c conda-forge tensorflow

Alternatively, you may choose to install using pip and a specific version of tensorflow for your platform.

See the installation instructions for tensorflow.

- 3. Install Keras by typing:

pip install keras

- 4. Confirm your deep learning environment is installed and working correctly.

Create a script that prints the version numbers of each library, as we did before for the SciPy environment.

# theano import theano print('theano: %s' % theano.__version__) # tensorflow import tensorflow print('tensorflow: %s' % tensorflow.__version__) # keras import keras print('keras: %s' % keras.__version__)

Save the script to a file *deep_versions.py*. Run the script by typing:

python deep_versions.py

You should see output like:

theano: 0.8.2.dev-901275534cbfe3fbbe290ce85d1abf8bb9a5b203 tensorflow: 0.12.1 Using TensorFlow backend. keras: 1.2.1

What versions did you get?

Paste your output in the comments below.

Try a Keras deep learning tutorial, such as:

This section provides some links for further reading.

- Anaconda Documentation
- Anaconda Documentation: Installation
- Conda
- Using conda
- Anaconda Navigator
- Installing Theano
- Install TensorFlow Anaconda
- Keras Installation

Congratulations, you now have a working Python development environment for machine learning and deep learning.

You can now learn and practice machine learning and deep learning on your workstation.

How did you go?

Let me know in the comments below.

The post How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda appeared first on Machine Learning Mastery.

]]>The post How to Create a Linux Virtual Machine For Machine Learning Development With Python 3 appeared first on Machine Learning Mastery.

]]>The tools can be installed quickly and easily and you can develop and run large models directly.

In this tutorial, you will discover how to create and setup a Linux virtual machine for machine learning with Python.

After completing this tutorial, you will know:

- How to download and install VirtualBox for managing virtual machines.
- How to download and setup Fedora Linux.
- How to install a SciPy environment for machine learning in Python 3.

This tutorial is suitable if your base operating system is Windows, Mac OS X, and Linux.

Let’s get started.

There are a number of reasons that you may want to use a Linux virtual machine for Python machine learning development.

For example, below is a list of 5 top benefits for using a virtual machine:

- To use tools not available on your system (if you’re on Windows).
- To install and use machine learning tools without impacting your local environment (e.g. use Python 3 tools).
- To have highly customized environments for different projects (Python2 and Python3).
- To save the state of the machine and pick up exactly where you left off (jump from machine to machine).
- To share development environment with other developers (set-up once and reuse many times).

Perhaps the most beneficial point is the first, being able to easily use machine learning tools not supported on your environment.

I’m an OS X user, and even though machine learning tools can be installed using *brew* and *macports*, I still find it easier to setup and use Linux virtual machines for machine learning development.

This tutorial is broken down into 3 parts:

- Download and Install VirtualBox.
- Download and Install Fedora Linux in a Virtual Machine.
- Install Python Machine Learning Environment

VirtualBox is a free open source platform for creating and managing virtual machines.

Once installed, you can create all the virtual machines you like, as long as you have the ISO images or CDs to install from.

- 1. Visit VirtualBox.org
- 2. Click “
*Download VirtualBox*” to access the Downloads page.

- 3. Choose binaries for your workstation.
- 4. Install the software for your system and follow the installation instructions.

- 5. Open the VirtualBox software and confirm it works.

I chose Fedora Linux because I think it is a kinder and gentler Linux than some.

It is a leading edge for RedHat Linux intended for workstations and developers.

Let’s start off by downloading the ISO for Fedora Linux. In this case, the 64-bit version of Fedora 25.

- 1. Visit GetFedora.org.
- 2. Click “
*Workstation*” to access the Workstation page. - 3. Click “
*Download now*” to access the Downloads page. - 4. Under “Other Downloads” click “
*64-bit 1.3GB Live image*“

- 5. You should now have an ISO file with the name:
- “
*Fedora-Workstation-Live-x86_64-25-1.3.iso*“.

- “

We are now ready to create the VM in VirtualBox.

Now, let’s create the Fedora virtual machine in VirtualBox.

- 1. Open the VirtualBox software.
- 2. Click “
*New*” button. - 3. Select the Name and operating system.
- name:
*Fedora25* - type:
*Linux* - version:
*Fedora (64-bit)* - Click “
*Continue*“

- name:

- 4. Configure the Memory Size
- 2048

- 5. Configure the Hard Disk
- Create a virtual hard disk now
- Hard disk file type
- VDI (VirtualBox Disk Image)
- Storage on physical hard disk
- Dynamically allocated
- File location and size:
*10GB*

We are now ready to install Fedora from the ISO image.

Now, let’s install Fedora Linux on the new virtual machine.

- 1. Select the new virtual machine and click the “
*Start*” button. - 2. Click Folder Icon and choose the Fedora ISO file:
- “
*Fedora-Workstation-Live-x86_64-25-1.3.iso*“.

- “

- 3. Click the “
*Start*” button. - 4. Select the first option “
*Start Fedora-Live-Workstation-Live 25*” and press the*Enter*key. - 5. Hit the “
*Esc*” key to skip the check. - 6. Select “
*Live System User*“. - 7. Select “
*Install to Hard Drive*“.

- 8. Complete “
*Language Selection*” (English) - 9. Complete “
*Installation Destination*” (“*ATA VBOX HARDDISK*“).- You may need to wait one minute for the VM to create the hard disk.

- 10. Click “
*Begin Installation*“. - 11. Set root password.
- 12. Create a user for yourself.
- Note down the username and password (so that you can use it later).
- Tick the “
*Make this user administrator*” (so you can install software).

- 13. Wait for the installation to complete… (5 minutes?)
- 14. Click “
*Quit*”, click power icon in top right; select power off.

Fedora Linux has been installed; let’s finalize the installation and make it ready for use.

- 1. In VirtualBox with the Fedora25 VM selected, under “
*Storage*“, click on “*Optical Drive*“.- Select “
*Remove disk from virtual drive*” to eject the ISO image.

- Select “
- 2. Click the “
*Start*” button to start the Fedora Linux installation. - 3. Login as the user you created.

- 4. Finalize installation
- Choose language “
*English*“ - Click “
*Next*“ - Choose Keyboard “
*US*“ - Click “
*Next*“ - Configure Privacy
- Click “
*Next*“ - Connect Your Online Accounts
- Click “
*Skip*“ - Click “
*Start using Fedora*“

- Choose language “
- 5. Close the help system that starts automatically.

We now have a Fedora Linux virtual machine ready to install new software.

Fedora uses Gnome 3 as the window manager.

Gnome 3 is quite different to prior versions of Gnome; you can learn how to get around by using the built-in help system.

Let’s start off by installing the required Python libraries for machine learning development.

- 1. Open the terminal.
- Click “
*Activities*“ - Type “
*terminal*“ - Click icon or press enter

- Click “

- 2. Confirm Python3 was installed.

Type:

python3 --version

- 3. Install the Python machine learning environment. Specifically:
- NumPy
- SciPy
- Pandas
- Matplotlib
- Statsmodels
- Scikit-Learn

DNF is the software installation system, formally yum. The first time you run *dnf*, it will update the database of packages, this might take a minute.

Type:

sudo dnf install python3-numpy python3-scipy python3-scikit-learn python3-pandas python3-matplotlib python3-statsmodels

Enter your password when prompted.

Confirm the installation when prompted by pressing “*y*” and “*enter*“.

Now that the environment is installed, we can confirm it by printing the versions of each required library.

- 1. Open Gedit.
- Click “
*Activities*“ - Type “
*gedit*“ - Click icon or press enter

- Click “
- 2. Type the following script and save it as
*versions.py*in the home directory.

# scipy import scipy print('scipy: %s' % scipy.__version__) # numpy import numpy print('numpy: %s' % numpy.__version__) # matplotlib import matplotlib print('matplotlib: %s' % matplotlib.__version__) # pandas import pandas print('pandas: %s' % pandas.__version__) # scikit-learn import sklearn print('sklearn: %s' % sklearn.__version__) # statsmodels import statsmodels print('statsmodels: %s' % statsmodels.__version__)

There is no copy-paste support; you may want to open Firefox within the VM and navigate to this page and copy paste the script into your Gedit window.

- 3. Run the script in the terminal.

Type:

python3 versions.py

This section lists some tips using the VM for machine learning development.

**Copy-paste and Folder Sharing**. These features require the installation of “Guest Additions” in the Linux VM. I have not been able to get this to install correctly and therefore do not use these features. You can try if you like; let me know how you do in the comments.**Use GitHub**. I recommend storing all of your code in GitHub and checking the code in and out from the VM. It makes life a lot easier for getting code and assets in and out of the VM.**Use Sublime**. I think sublime is a great text editor on Linux for development, better than Gedit at least.**Use AWS for large jobs**. You can use the same procedure to setup Fedora Linux on Amazon Web Services for running large models in the cloud.**VM Tools**. You can save the VM at any point by closing the window. You can also take a snapshot of the VM at any point and return to the snapshot. This can be helpful if you are making large changes to the file system.**Python2**. You can easily install Python2 alongside Python 3 in Linux and use the python (rather than python3) binary or use alternatives to switch between the two.**Notebooks**. Consider running a notebook server inside the VM and opening up the firewall so that you can connect and run from your main workstation outside of the VM.

Do you have any tips to share? Let me know in the comments.

Below are some resources for further reading if you are new to the tools used in this tutorial.

- VirtualBox User Manual
- Fedora Documentation
- Fedora Wiki (tons of help on common topics)
- SciPy Homepage
- Scikit-Learn Homepage

In this tutorial, you discovered how to setup a Linux virtual machine for Python machine learning development.

Specifically, you learned:

- How to download and install VirtualBox, free, open-source software for managing virtual machines.
- How to download and setup Fedora Linux, a friendly Linux distribution for developers.
- How to install and test a Python3 environment for machine learning development.

Did you complete the tutorial?

Let me know how it went in the comments below.

The post How to Create a Linux Virtual Machine For Machine Learning Development With Python 3 appeared first on Machine Learning Mastery.

]]>