The post What Is Semi-Supervised Learning appeared first on Machine Learning Mastery.

]]>Learning problems of this type are challenging as neither supervised nor unsupervised learning algorithms are able to make effective use of the mixtures of labeled and untellable data. As such, specialized semis-supervised learning algorithms are required.

In this tutorial, you will discover a gentle introduction to the field of semi-supervised learning for machine learning.

After completing this tutorial, you will know:

- Semi-supervised learning is a type of machine learning that sits between supervised and unsupervised learning.
- Top books on semi-supervised learning designed to get you up to speed in the field.
- Additional resources on semi-supervised learning, such as review papers and APIs.

Let’s get started.

This tutorial is divided into three parts; they are:

- Semi-Supervised Learning
- Books on Semi-Supervised Learning
- Additional Resources

Semi-supervised learning is a type of machine learning.

It refers to a learning problem (and algorithms designed for the learning problem) that involves a small portion of labeled examples and a large number of unlabeled examples from which a model must learn and make predictions on new examples.

… dealing with the situation where relatively few labeled training points are available, but a large number of unlabeled points are given, it is directly relevant to a multitude of practical problems where it is relatively expensive to produce labeled data …

— Page xiii, Semi-Supervised Learning, 2006.

As such, it is a learning problem that sits between supervised learning and unsupervised learning.

Semi-supervised learning (SSL) is halfway between supervised and unsupervised learning. In addition to unlabeled data, the algorithm is provided with some super- vision information – but not necessarily for all examples. Often, this information will be the targets associated with some of the examples.

— Page 2, Semi-Supervised Learning, 2006.

We require semi-supervised learning algorithms when working with data where labeling examples is challenging or expensive.

Semi-supervised learning has tremendous practical value. In many tasks, there is a paucity of labeled data. The labels y may be difficult to obtain because they require human annotators, special devices, or expensive and slow experiments.

— Page 9, Introduction to Semi-Supervised Learning, 2009.

The sign of an effective semi-supervised learning algorithm is that it can achieve better performance than a supervised learning algorithm fit only on the labeled training examples.

Semi-supervised learning algorithms generally are able to clear this low bar expectation.

… in comparison with a supervised algorithm that uses only labeled data, can one hope to have a more accurate prediction by taking into account the unlabeled points? […] in principle the answer is ‘yes.’”

— Page 4, Semi-Supervised Learning, 2006.

Finally, semi-supervised learning may be used or may contrast inductive and transductive learning.

Generally, inductive learning refers to a learning algorithm that learns from labeled training data and generalizes to new data, such as a test dataset. Transductive learning refers to learning from labeled training data and generalizing to available unlabeled (training) data. Both types of learning tasks may be performed by a semi-supervised learning algorithm.

… there are two distinct goals. One is to predict the labels on future test data. The other goal is to predict the labels on the unlabeled instances in the training sample. We call the former inductive semi-supervised learning, and the latter transductive learning.

— Page 12, Introduction to Semi-Supervised Learning, 2009.

If you are new to the idea of transduction vs. induction, the following tutorial has more information:

Now that we are familiar with semi-supervised learning from a high-level, let’s take a look at top books on the topic.

Semi-supervised learning is a new and fast-moving field of study, and as such, there are very few books on the topic.

There are perhaps two key books on semi-supervised learning that you should consider if you are new to the topic; they are:

Let’s take a closer look at each in turn.

The book “Semi-Supervised Learning” was published in 2006 and was edited by Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien.

This book provides a large number of chapters, each written by top researchers in the field.

It is designed to take you on a tour of the field of research including intuitions, top techniques, and open problems.

The full table of contents is listed below.

- Chapter 01: Introduction to Semi-Supervised Learning
- Part I: Generative Models
- Chapter 02: A Taxonomy for Semi-Supervised Learning Methods
- Chapter 03: Semi-Supervised Text Classification Using EM
- Chapter 04: Risks of Semi-Supervised Learning
- Chapter 05: Probabilistic Semi-Supervised Clustering with Constraints

- Part II: Low-Density Separation
- Chapter 06: Transductive Support Vector Machines
- Chapter 07: Semi-Supervised Learning Using Semi-Definite Programming
- Chapter 08: Gaussian Processes and the Null-Category Noise Model
- Chapter 09: Entropy Regularization
- Chapter 10: Data-Dependent Regularization

- Part III: Graph-Based Methods
- Chapter 11: Label Propagation and Quadratic Criterion
- Chapter 12: The Geometric Basis of Semi-Supervised Learning
- Chapter 13: Discrete Regularization
- Chapter 14: Semi-Supervised Learning with Conditional Harmonic Mixing

- Part IV: Change of Representation
- Chapter 15: Graph Kernels by Spectral Transforms
- Chapter 16: Spectral Methods for Dimensionality Reduction
- Chapter 17: Modifying Distances

- Part V: Semi-Supervised Learning in Practice
- Chapter 18: Large-Scale Algorithms
- Chapter 19: Semi-Supervised Protein Classification Using Cluster Kernels
- Chapter 20: Prediction of Protein Function from Networks
- Chapter 21: Analysis of Benchmarks

- Part VI: Perspectives
- Chapter 22: An Augmented PAC Model for Semi-Supervised Learning
- Chapter 23: Metric-Based Approaches for Semi-Supervised Regression and Classification
- Chapter 24: Transductive Inference and Semi-Supervised Learning
- Chapter 25: A Discussion of Semi-Supervised Learning and Transduction

I highly recommend this book and reading it cover to cover if you are starting out in this field.

The book “Introduction to Semi-Supervised Learning” was published in 2009 and was written by Xiaojin Zhu and Andrew Goldberg.

This book is aimed at students, researchers, and engineers just getting started in the field.

The book is a beginner’s guide to semi-supervised learning. It is aimed at advanced under-graduates, entry-level graduate students and researchers in areas as diverse as Computer Science, Electrical Engineering, Statistics, and Psychology.

— Page xiii, Introduction to Semi-Supervised Learning, 2009.

It’s a shorter read than the above book and a great introduction.

The full table of contents is listed below.

- Chapter 01: Introduction to Statistical Machine Learning
- Chapter 02: Overview of Semi-Supervised Learning
- Chapter 03: Mixture Models and EM
- Chapter 04: Co-Training
- Chapter 05: Graph-Based Semi-Supervised Learning
- Chapter 06: Semi-Supervised Support Vector Machines
- Chapter 07: Human Semi-Supervised Learning
- Chapter 08: Theory and Outlook

I also recommend this book if you’re just starting out for a quick review of the key elements of the field.

There are some additional books on semi-supervised learning that you might also like to consider; they are:

- Semi-Supervised Learning: Background, Applications and Future Directions, 2018.
- Graph-Based Semi-Supervised Learning, 2014.

**Have you read any of the above books?**

What did you think?

**Did I miss your favorite book?**

Let me know in the comments below.

There are additional resources that may be helpful when getting started in the field of semi-supervised learning.

I would recommend reading some review papers.

Some examples of good review papers on semi-supervised learning include:

- Semi-Supervised Learning Literature Survey, 2005.
- Introduction to Semi-Supervised Learning, 2009.
- An Overview of Deep Semi-Supervised Learning, 2020.

In this paper, we provide a comprehensive overview of deep semi-supervised learning, starting with an introduction to the field, followed by a summarization of the dominant semi-supervised approaches in deep learning.

— An Overview of Deep Semi-Supervised Learning, 2020.

It is also a good idea to try out some of the algorithms.

The scikit-learn Python machine learning library provides a few graph-based semi-supervised learning algorithms that you can try:

The Wikipedia article may also provide some useful links for further reading:

In this tutorial, you discovered a gentle introduction to the field of semi-supervised learning for machine learning.

Specifically, you learned:

- Semi-supervised learning is a type of machine learning that sits between supervised and unsupervised learning.
- Top books on semi-supervised learning designed to get you up to speed in the field.
- Additional resources on semi-supervised learning, such as review papers and APIs.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post What Is Semi-Supervised Learning appeared first on Machine Learning Mastery.

]]>The post Sensitivity Analysis of Dataset Size vs. Model Performance appeared first on Machine Learning Mastery.

]]>This depends on the specific datasets and on the choice of model, although it often means that using more data can result in better performance and that discoveries made using smaller datasets to estimate model performance often scale to using larger datasets.

The problem is the relationship is unknown for a given dataset and model, and may not exist for some datasets and models. Additionally, if such a relationship does exist, there may be a point or points of diminishing returns where adding more data may not improve model performance or where datasets are too small to effectively capture the capability of a model at a larger scale.

These issues can be addressed by performing a **sensitivity analysis** to quantify the relationship between dataset size and model performance. Once calculated, we can interpret the results of the analysis and make decisions about how much data is enough, and how small a dataset may be to effectively estimate performance on larger datasets.

In this tutorial, you will discover how to perform a sensitivity analysis of dataset size vs. model performance.

After completing this tutorial, you will know:

- Selecting a dataset size for machine learning is a challenging open problem.
- Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.
- How to perform a sensitivity analysis of dataset size and interpret the results.

Let’s get started.

This tutorial is divided into three parts; they are:

- Dataset Size Sensitivity Analysis
- Synthetic Prediction Task and Baseline Model
- Sensitivity Analysis of Dataset Size

The amount of training data required for a machine learning predictive model is an open question.

It depends on your choice of model, on the way you prepare the data, and on the specifics of the data itself.

For more on the challenge of selecting a training dataset size, see the tutorial:

One way to approach this problem is to perform a sensitivity analysis and discover how the performance of your model on your dataset varies with more or less data.

This might involve evaluating the same model with different sized datasets and looking for a relationship between dataset size and performance or a point of diminishing returns.

Typically, there is a strong relationship between training dataset size and model performance, especially for nonlinear models. The relationship often involves an improvement in performance to a point and a general reduction in the expected variance of the model as the dataset size is increased.

Knowing this relationship for your model and dataset can be helpful for a number of reasons, such as:

- Evaluate more models.
- Find a better model.
- Decide to gather more data.

You can evaluate a large number of models and model configurations quickly on a smaller sample of the dataset with confidence that the performance will likely generalize in a specific way to a larger training dataset.

This may allow evaluating many more models and configurations than you may otherwise be able to given the time available, and in turn, perhaps discover a better overall performing model.

You may also be able to generalize and estimate the expected performance of model performance to much larger datasets and estimate whether it is worth the effort or expense of gathering more training data.

Now that we are familiar with the idea of performing a sensitivity analysis of model performance to dataset size, let’s look at a worked example.

Before we dive into a sensitivity analysis, let’s select a dataset and baseline model for the investigation.

We will use a synthetic binary (two-class) classification dataset in this tutorial. This is ideal as it allows us to scale the number of generated samples for the same problem as needed.

The make_classification() scikit-learn function can be used to create a synthetic classification dataset. In this case, we will use 20 input features (columns) and generate 1,000 samples (rows). The seed for the pseudo-random number generator is fixed to ensure the same base “problem” is used each time samples are generated.

The example below generates the synthetic classification dataset and summarizes the shape of the generated data.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example generates the data and reports the size of the input and output components, confirming the expected shape.

(1000, 20) (1000,)

Next, we can evaluate a predictive model on this dataset.

We will use a decision tree (DecisionTreeClassifier) as the predictive model. It was chosen because it is a nonlinear algorithm and has a high variance, which means that we would expect performance to improve with increases in the size of the training dataset.

We will use a best practice of repeated stratified k-fold cross-validation to evaluate the model on the dataset, with 3 repeats and 10 folds.

The complete example of evaluating the decision tree model on the synthetic classification dataset is listed below.

# evaluate a decision tree model on the synthetic classification dataset from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier # load dataset X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1) # define model evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = DecisionTreeClassifier() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report performance print('Mean Accuracy: %.3f (%.3f)' % (scores.mean(), scores.std()))

Running the example creates the dataset then estimates the performance of the model on the problem using the chosen test harness.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the mean classification accuracy is about 82.7%.

Mean Accuracy: 0.827 (0.042)

Next, let’s look at how we might perform a sensitivity analysis of dataset size on model performance.

The previous section showed how to evaluate a chosen model on the available dataset.

It raises questions, such as:

Will the model perform better on more data?

More generally, we may have sophisticated questions such as:

Does the estimated performance hold on smaller or larger samples from the problem domain?

These are hard questions to answer, but we can approach them by using a sensitivity analysis. Specifically, we can use a sensitivity analysis to learn:

How sensitive is model performance to dataset size?

Or more generally:

What is the relationship of dataset size to model performance?

There are many ways to perform a sensitivity analysis, but perhaps the simplest approach is to define a test harness to evaluate model performance and then evaluate the same model on the same problem with differently sized datasets.

This will allow the train and test portions of the dataset to increase with the size of the overall dataset.

To make the code easier to read, we will split it up into functions.

First, we can define a function that will prepare (or load) the dataset of a given size. The number of rows in the dataset is specified by an argument to the function.

If you are using this code as a template, this function can be changed to load your dataset from file and select a random sample of a given size.

# load dataset def load_dataset(n_samples): # define the dataset X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y

Next, we need a function to evaluate a model on a loaded dataset.

We will define a function that takes a dataset and returns a summary of the performance of the model evaluated using the test harness on the dataset.

This function is listed below, taking the input and output elements of a dataset and returning the mean and standard deviation of the decision tree model on the dataset.

# evaluate a model def evaluate_model(X, y): # define model evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = DecisionTreeClassifier() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # return summary stats return [scores.mean(), scores.std()]

Next, we can define a range of different dataset sizes to evaluate.

The sizes should be chosen proportional to the amount of data you have available and the amount of running time you are willing to expend.

In this case, we will keep the sizes modest to limit running time, from 50 to one million rows on a rough log10 scale.

... # define number of samples to consider sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000]

Next, we can enumerate each dataset size, create the dataset, evaluate a model on the dataset, and store the results for later analysis.

... # evaluate each number of samples means, stds = list(), list() for n_samples in sizes: # get a dataset X, y = load_dataset(n_samples) # evaluate a model on this dataset size mean, std = evaluate_model(X, y) # store means.append(mean) stds.append(std)

Next, we can summarize the relationship between the dataset size and model performance.

In this case, we will simply plot the result with error bars so we can spot any trends visually.

We will use the standard deviation as a measure of uncertainty on the estimated model performance. This can be achieved by multiplying the value by 2 to cover approximately 95% of the expected performance if the performance follows a normal distribution.

This can be shown on the plot as an error bar around the mean expected performance for a dataset size.

... # define error bar as 2 standard deviations from the mean or 95% err = [min(1, s * 2) for s in stds] # plot dataset size vs mean performance with error bars pyplot.errorbar(sizes, means, yerr=err, fmt='-o')

To make the plot more readable, we can change the scale of the x-axis to log, given that our dataset sizes are on a rough log10 scale.

... # change the scale of the x-axis to log ax = pyplot.gca() ax.set_xscale("log", nonpositive='clip') # show the plot pyplot.show()

And that’s it.

We would generally expect mean model performance to increase with dataset size. We would also expect the uncertainty in model performance to decrease with dataset size.

Tying this all together, the complete example of performing a sensitivity analysis of dataset size on model performance is listed below.

# sensitivity analysis of model performance to dataset size from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # load dataset def load_dataset(n_samples): # define the dataset X, y = make_classification(n_samples=int(n_samples), n_features=20, n_informative=15, n_redundant=5, random_state=1) return X, y # evaluate a model def evaluate_model(X, y): # define model evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define model model = DecisionTreeClassifier() # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # return summary stats return [scores.mean(), scores.std()] # define number of samples to consider sizes = [50, 100, 500, 1000, 5000, 10000, 50000, 100000, 500000, 1000000] # evaluate each number of samples means, stds = list(), list() for n_samples in sizes: # get a dataset X, y = load_dataset(n_samples) # evaluate a model on this dataset size mean, std = evaluate_model(X, y) # store means.append(mean) stds.append(std) # summarize performance print('>%d: %.3f (%.3f)' % (n_samples, mean, std)) # define error bar as 2 standard deviations from the mean or 95% err = [min(1, s * 2) for s in stds] # plot dataset size vs mean performance with error bars pyplot.errorbar(sizes, means, yerr=err, fmt='-o') # change the scale of the x-axis to log ax = pyplot.gca() ax.set_xscale("log", nonpositive='clip') # show the plot pyplot.show()

Running the example reports the status along the way of dataset size vs. estimated model performance.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see the expected trend of increasing mean model performance with dataset size and decreasing model variance measured using the standard deviation of classification accuracy.

We can see that there is perhaps a point of diminishing returns in estimating model performance at perhaps 10,000 or 50,000 rows.

Specifically, we do see an improvement in performance with more rows, but we can probably capture this relationship with little variance with 10K or 50K rows of data.

We can also see a drop-off in estimated performance with 1,000,000 rows of data, suggesting that we are probably maxing out the capability of the model above 100,000 rows and are instead measuring statistical noise in the estimate.

This might mean an upper bound on expected performance and likely that more data beyond this point will not improve the specific model and configuration on the chosen test harness.

>50: 0.673 (0.141) >100: 0.703 (0.135) >500: 0.809 (0.055) >1000: 0.826 (0.044) >5000: 0.835 (0.016) >10000: 0.866 (0.011) >50000: 0.900 (0.005) >100000: 0.912 (0.003) >500000: 0.938 (0.001) >1000000: 0.936 (0.001)

The plot makes the relationship between dataset size and estimated model performance much clearer.

The relationship is nearly linear with a log dataset size. The change in the uncertainty shown as the error bar also dramatically decreases on the plot from very large values with 50 or 100 samples, to modest values with 5,000 and 10,000 samples and practically gone beyond these sizes.

Given the modest spread with 5,000 and 10,000 samples and the practically log-linear relationship, we could probably get away with using 5K or 10K rows to approximate model performance.

We could use these findings as the basis for testing additional model configurations and even different model types.

The danger is that different models may perform very differently with more or less data and it may be wise to repeat the sensitivity analysis with a different chosen model to confirm the relationship holds. Alternately, it may be interesting to repeat the analysis with a suite of different model types.

This section provides more resources on the topic if you are looking to go deeper.

- Sensitivity Analysis of History Size to Forecast Skill with ARIMA in Python
- How Much Training Data is Required for Machine Learning?

In this tutorial, you discovered how to perform a sensitivity analysis of dataset size vs. model performance.

Specifically, you learned:

- Selecting a dataset size for machine learning is a challenging open problem.
- Sensitivity analysis provides an approach to quantifying the relationship between model performance and dataset size for a given model and prediction problem.
- How to perform a sensitivity analysis of dataset size and interpret the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Sensitivity Analysis of Dataset Size vs. Model Performance appeared first on Machine Learning Mastery.

]]>The post Regression Metrics for Machine Learning appeared first on Machine Learning Mastery.

]]>It is different from classification that involves predicting a class label. Unlike classification, you cannot use classification accuracy to evaluate the predictions made by a regression model.

Instead, you must use error metrics specifically designed for evaluating predictions made on regression problems.

In this tutorial, you will discover how to calculate **error metrics for regression** predictive modeling projects.

After completing this tutorial, you will know:

- Regression predictive modeling are those problems that involve predicting a numeric value.
- Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
- How to calculate and report mean squared error, root mean squared error, and mean absolute error.

Let’s get started.

This tutorial is divided into three parts; they are:

- Regression Predictive Modeling
- Evaluating Regression Models
- Metrics for Regression
- Mean Squared Error
- Root Mean Squared Error
- Mean Absolute Error

Predictive modeling is the problem of developing a model using historical data to make a prediction on new data where we do not have the answer.

Predictive modeling can be described as the mathematical problem of approximating a mapping function (f) from input variables (X) to output variables (y). This is called the problem of function approximation.

The job of the modeling algorithm is to find the best mapping function we can given the time and resources available.

For more on approximating functions in applied machine learning, see the post:

Regression predictive modeling is the task of approximating a mapping function (*f*) from input variables (*X*) to a continuous output variable (*y*).

Regression is different from classification, which involves predicting a category or class label.

For more on the difference between classification and regression, see the tutorial:

A continuous output variable is a real-value, such as an integer or floating point value. These are often quantities, such as amounts and sizes.

For example, a house may be predicted to sell for a specific dollar value, perhaps in the range of $100,000 to $200,000.

- A regression problem requires the prediction of a quantity.
- A regression can have real-valued or discrete input variables.
- A problem with multiple input variables is often called a multivariate regression problem.
- A regression problem where input variables are ordered by time is called a time series forecasting problem.

Now that we are familiar with regression predictive modeling, let’s look at how we might evaluate a regression model.

A common question by beginners to regression predictive modeling projects is:

How do I calculate accuracy for my regression model?

Accuracy (e.g. classification accuracy) is a measure for classification, not regression.

**We cannot calculate accuracy for a regression model**.

The skill or performance of a regression model must be reported as an error in those predictions.

This makes sense if you think about it. If you are predicting a numeric value like a height or a dollar amount, you don’t want to know if the model predicted the value exactly (this might be intractably difficult in practice); instead, we want to know how close the predictions were to the expected values.

Error addresses exactly this and summarizes on average how close predictions were to their expected values.

There are three error metrics that are commonly used for evaluating and reporting the performance of a regression model; they are:

- Mean Squared Error (MSE).
- Root Mean Squared Error (RMSE).
- Mean Absolute Error (MAE)

There are many other metrics for regression, although these are the most commonly used. You can see the full list of regression metrics supported by the scikit-learn Python machine learning library here:

In the next section, let’s take a closer look at each in turn.

In this section, we will take a closer look at the popular metrics for regression models and how to calculate them for your predictive modeling project.

Mean Squared Error, or MSE for short, is a popular error metric for regression problems.

It is also an important loss function for algorithms fit or optimized using the least squares framing of a regression problem. Here “*least squares*” refers to minimizing the mean squared error between predictions and expected values.

The MSE is calculated as the mean or average of the squared differences between predicted and expected target values in a dataset.

- MSE = 1 / N * sum for i to N (y_i – yhat_i)^2

Where *y_i* is the i’th expected value in the dataset and *yhat_i* is the i’th predicted value. The difference between these two values is squared, which has the effect of removing the sign, resulting in a positive error value.

The squaring also has the effect of inflating or magnifying large errors. That is, the larger the difference between the predicted and expected values, the larger the resulting squared positive error. This has the effect of “*punishing*” models more for larger errors when MSE is used as a loss function. It also has the effect of “*punishing*” models by inflating the average error score when used as a metric.

We can create a plot to get a feeling for how the change in prediction error impacts the squared error.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1 increments. The squared error between each prediction and expected value is calculated and plotted to show the quadratic increase in squared error.

... # calculate error err = (expected[i] - predicted[i])**2

The complete example is listed below.

# example of increase in mean squared error from matplotlib import pyplot from sklearn.metrics import mean_squared_error # real value expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] # predicted value predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0] # calculate errors errors = list() for i in range(len(expected)): # calculate error err = (expected[i] - predicted[i])**2 # store error errors.append(err) # report error print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err)) # plot errors pyplot.plot(errors) pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted) pyplot.xlabel('Predicted Value') pyplot.ylabel('Mean Squared Error') pyplot.show()

Running the example first reports the expected value, predicted value, and squared error for each case.

We can see that the error rises quickly, faster than linear (a straight line).

>1.0, 1.0 = 0.000 >1.0, 0.9 = 0.010 >1.0, 0.8 = 0.040 >1.0, 0.7 = 0.090 >1.0, 0.6 = 0.160 >1.0, 0.5 = 0.250 >1.0, 0.4 = 0.360 >1.0, 0.3 = 0.490 >1.0, 0.2 = 0.640 >1.0, 0.1 = 0.810 >1.0, 0.0 = 1.000

A line plot is created showing the curved or super-linear increase in the squared error value as the difference between the expected and predicted value is increased.

The curve is not a straight line as we might naively assume for an error metric.

The individual error terms are averaged so that we can report the performance of a model with regard to how much error the model makes generally when making predictions, rather than specifically for a given example.

The units of the MSE are squared units.

For example, if your target value represents “*dollars*,” then the MSE will be “*squared dollars*.” This can be confusing for stakeholders; therefore, when reporting results, often the root mean squared error is used instead (*discussed in the next section*).

The mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the scikit-learn library.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

... # calculate errors errors = mean_squared_error(expected, predicted)

The example below gives an example of calculating the mean squared error between a list of contrived expected and predicted values.

# example of calculate the mean squared error from sklearn.metrics import mean_squared_error # real value expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] # predicted value predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0] # calculate errors errors = mean_squared_error(expected, predicted) # report error print(errors)

Running the example calculates and prints the mean squared error.

0.35000000000000003

A perfect mean squared error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MSE is relative to your specific dataset.

It is a good idea to first establish a baseline MSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an MSE better than the MSE for the naive model has skill.

The Root Mean Squared Error, or RMSE, is an extension of the mean squared error.

Importantly, the square root of the error is calculated, which means that the units of the RMSE are the same as the original units of the target value that is being predicted.

For example, if your target variable has the units “*dollars*,” then the RMSE error score will also have the unit “*dollars*” and not “*squared dollars*” like the MSE.

As such, it may be common to use MSE loss to train a regression predictive model, and to use RMSE to evaluate and report its performance.

The RMSE can be calculated as follows:

- RMSE = sqrt(1 / N * sum for i to N (y_i – yhat_i)^2)

Where *y_i* is the i’th expected value in the dataset, *yhat_i* is the i’th predicted value, and *sqrt()* is the square root function.

We can restate the RMSE in terms of the MSE as:

- RMSE = sqrt(MSE)

Note that the RMSE cannot be calculated as the average of the square root of the mean squared error values. This is a common error made by beginners and is an example of Jensen’s inequality.

You may recall that the square root is the inverse of the square operation. MSE uses the square operation to remove the sign of each error value and to punish large errors. The square root reverses this operation, although it ensures that the result remains positive.

The root mean squared error between your expected and predicted values can be calculated using the mean_squared_error() function from the scikit-learn library.

By default, the function calculates the MSE, but we can configure it to calculate the square root of the MSE by setting the “*squared*” argument to *False*.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean squared error value.

... # calculate errors errors = mean_squared_error(expected, predicted, squared=False)

The example below gives an example of calculating the root mean squared error between a list of contrived expected and predicted values.

# example of calculate the root mean squared error from sklearn.metrics import mean_squared_error # real value expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] # predicted value predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0] # calculate errors errors = mean_squared_error(expected, predicted, squared=False) # report error print(errors)

Running the example calculates and prints the root mean squared error.

0.5916079783099616

A perfect RMSE value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good RMSE is relative to your specific dataset.

It is a good idea to first establish a baseline RMSE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves an RMSE better than the RMSE for the naive model has skill.

Mean Absolute Error, or MAE, is a popular metric because, like RMSE, the units of the error score match the units of the target value that is being predicted.

Unlike the RMSE, the changes in MAE are linear and therefore intuitive.

That is, MSE and RMSE punish larger errors more than smaller errors, inflating or magnifying the mean error score. This is due to the square of the error value. The MAE does not give more or less weight to different types of errors and instead the scores increase linearly with increases in error.

As its name suggests, the MAE score is calculated as the average of the absolute error values. Absolute or *abs()* is a mathematical function that simply makes a number positive. Therefore, the difference between an expected and predicted value may be positive or negative and is forced to be positive when calculating the MAE.

The MAE can be calculated as follows:

- MAE = 1 / N * sum for i to N abs(y_i – yhat_i)

Where *y_i* is the i’th expected value in the dataset, *yhat_i* is the i’th predicted value and *abs()* is the absolute function.

We can create a plot to get a feeling for how the change in prediction error impacts the MAE.

The example below gives a small contrived dataset of all 1.0 values and predictions that range from perfect (1.0) to wrong (0.0) by 0.1 increments. The absolute error between each prediction and expected value is calculated and plotted to show the linear increase in error.

... # calculate error err = abs((expected[i] - predicted[i]))

The complete example is listed below.

# plot of the increase of mean absolute error with prediction error from matplotlib import pyplot from sklearn.metrics import mean_squared_error # real value expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] # predicted value predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0] # calculate errors errors = list() for i in range(len(expected)): # calculate error err = abs((expected[i] - predicted[i])) # store error errors.append(err) # report error print('>%.1f, %.1f = %.3f' % (expected[i], predicted[i], err)) # plot errors pyplot.plot(errors) pyplot.xticks(ticks=[i for i in range(len(errors))], labels=predicted) pyplot.xlabel('Predicted Value') pyplot.ylabel('Mean Absolute Error') pyplot.show()

Running the example first reports the expected value, predicted value, and absolute error for each case.

We can see that the error rises linearly, which is intuitive and easy to understand.

>1.0, 1.0 = 0.000 >1.0, 0.9 = 0.100 >1.0, 0.8 = 0.200 >1.0, 0.7 = 0.300 >1.0, 0.6 = 0.400 >1.0, 0.5 = 0.500 >1.0, 0.4 = 0.600 >1.0, 0.3 = 0.700 >1.0, 0.2 = 0.800 >1.0, 0.1 = 0.900 >1.0, 0.0 = 1.000

A line plot is created showing the straight line or linear increase in the absolute error value as the difference between the expected and predicted value is increased.

The mean absolute error between your expected and predicted values can be calculated using the mean_absolute_error() function from the scikit-learn library.

The function takes a one-dimensional array or list of expected values and predicted values and returns the mean absolute error value.

... # calculate errors errors = mean_absolute_error(expected, predicted)

The example below gives an example of calculating the mean absolute error between a list of contrived expected and predicted values.

# example of calculate the mean absolute error from sklearn.metrics import mean_absolute_error # real value expected = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] # predicted value predicted = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0] # calculate errors errors = mean_absolute_error(expected, predicted) # report error print(errors)

Running the example calculates and prints the mean absolute error.

0.5

A perfect mean absolute error value is 0.0, which means that all predictions matched the expected values exactly.

This is almost never the case, and if it happens, it suggests your predictive modeling problem is trivial.

A good MAE is relative to your specific dataset.

It is a good idea to first establish a baseline MAE for your dataset using a naive predictive model, such as predicting the mean target value from the training dataset. A model that achieves a MAE better than the MAE for the naive model has skill.

This section provides more resources on the topic if you are looking to go deeper.

- How Machine Learning Algorithms Work
- Difference Between Classification and Regression in Machine Learning

- Scikit-Learn API: Regression Metrics.
- Scikit-Learn User Guide Section 3.3.4. Regression metrics.
- sklearn.metrics.mean_squared_error API.
- mean_absolute_error API.

- Mean squared error, Wikipedia.
- Root-mean-square deviation, Wikipedia.
- Mean absolute error, Wikipedia.
- Coefficient of determination, Wikipedia.

In this tutorial, you discovered how to calculate error for regression predictive modeling projects.

Specifically, you learned:

- Regression predictive modeling are those problems that involve predicting a numeric value.
- Metrics for regression involve calculating an error score to summarize the predictive skill of a model.
- How to calculate and report mean squared error, root mean squared error, and mean absolute error.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Regression Metrics for Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Machine Learning Modeling Pipelines appeared first on Machine Learning Mastery.

]]>Effective use of the model will require appropriate preparation of the input data and hyperparameter tuning of the model.

Collectively, the linear sequence of steps required to prepare the data, tune the model, and transform the predictions is called the **modeling pipeline**. Modern machine learning libraries like the scikit-learn Python library allow this sequence of steps to be defined and used correctly (without data leakage) and consistently (during evaluation and prediction).

Nevertheless, working with modeling pipelines can be confusing to beginners as it requires a shift in perspective of the applied machine learning process.

In this tutorial, you will discover modeling pipelines for applied machine learning.

After completing this tutorial, you will know:

- Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
- Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
- Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

Let’s get started.

This tutorial is divided into three parts; they are:

- Finding a Skillful Model Is Not Enough
- What Is a Modeling Pipeline?
- Implications of a Modeling Pipeline

Applied machine learning is the process of discovering the model that performs best for a given predictive modeling dataset.

In fact, it’s more than this.

In addition to discovering which model performs the best on your dataset, you must also discover:

**Data transforms**that best expose the unknown underlying structure of the problem to the learning algorithms.**Model hyperparameters**that result in a good or best configuration of a chosen model.

There may also be additional considerations such as techniques that transform the predictions made by the model, like threshold moving or model calibration for predicted probabilities.

As such, it is common to think of applied machine learning as a large combinatorial search problem across data transforms, models, and model configurations.

This can be quite challenging in practice as it requires that the sequence of one or more data preparation schemes, the model, the model configuration, and any prediction transform schemes must be evaluated consistently and correctly on a given test harness.

Although tricky, it may be manageable with a simple train-test split but becomes quite unmanageable when using k-fold cross-validation or even repeated k-fold cross-validation.

The solution is to use a modeling pipeline to keep everything straight.

A pipeline is a linear sequence of data preparation options, modeling operations, and prediction transform operations.

It allows the sequence of steps to be specified, evaluated, and used as an atomic unit.

**Pipeline**: A linear sequence of data preparation and modeling steps that can be treated as an atomic unit.

To make the idea clear, let’s look at two simple examples:

The first example uses data normalization for the input variables and fits a logistic regression model:

- [Input], [Normalization], [Logistic Regression], [Predictions]

The second example standardizes the input variables, applies RFE feature selection, and fits a support vector machine.

- [Input], [Standardization], [RFE], [SVM], [Predictions]

You can imagine other examples of modeling pipelines.

As an atomic unit, the pipeline can be evaluated using a preferred resampling scheme such as a train-test split or k-fold cross-validation.

This is important for two main reasons:

- Avoid data leakage.
- Consistency and reproducibility.

A modeling pipeline avoids the most common type of data leakage where data preparation techniques, such as scaling input values, are applied to the entire dataset. This is data leakage because it shares knowledge of the test dataset (such as observations that contribute to a mean or maximum known value) with the training dataset, and in turn, may result in overly optimistic model performance.

Instead, data transforms must be prepared on the training dataset only, then applied to the training dataset, test dataset, validation dataset, and any other datasets that require the transform prior to being used with the model.

A modeling pipeline ensures that the sequence of data preparation operations performed is reproducible.

Without a modeling pipeline, the data preparation steps may be performed manually twice: once for evaluating the model and once for making predictions. Any changes to the sequence must be kept consistent in both cases, otherwise differences will impact the capability and skill of the model.

A pipeline ensures that the sequence of operations is defined once and is consistent when used for model evaluation or making predictions.

The Python scikit-learn machine learning library provides a machine learning modeling pipeline via the Pipeline class.

You can learn more about how to use this Pipeline API in this tutorial:

The modeling pipeline is an important tool for machine learning practitioners.

Nevertheless, there are important implications that must be considered when using them.

The main confusion for beginners when using pipelines comes in understanding what the pipeline has learned or the specific configuration discovered by the pipeline.

For example, a pipeline may use a data transform that configures itself automatically, such as the RFECV technique for feature selection.

- When evaluating a pipeline that uses an automatically-configured data transform, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

**The answer is, it doesn’t matter**.

Another example is the use of hyperparameter tuning as the final step of the pipeline.

The grid search will be performed on the data provided by any prior transform steps in the pipeline and will then search for the best combination of hyperparameters for the model using that data, then fit a model with those hyperparameters on the data.

- When evaluating a pipeline that grid searches model hyperparameters, what configuration does it choose? or When fitting this pipeline as a final model for making predictions, what configuration did it choose?

**The answer again is, it doesn’t matter**.

The answer applies when using a threshold moving or probability calibration step at the end of the pipeline.

The reason is the same reason that we are not concerned about the specific internal structure or coefficients of the chosen model.

For example, when evaluating a logistic regression model, we don’t need to inspect the coefficients chosen on each k-fold cross-validation round in order to choose the model. Instead, we focus on its out-of-fold predictive skill

Similarly, when using a logistic regression model as the final model for making predictions on new data, we do not need to inspect the coefficients chosen when fitting the model on the entire dataset before making predictions.

We can inspect and discover the coefficients used by the model as an exercise in analysis, but it does not impact the selection and use of the model.

This same answer generalizes when considering a modeling pipeline.

We are not concerned about which features may have been automatically selected by a data transform in the pipeline. We are also not concerned about which hyperparameters were chosen for the model when using a grid search as the final step in the modeling pipeline.

In all three cases: the single model, the pipeline with automatic feature selection, and the pipeline with a grid search, we are evaluating the “*model*” or “*modeling pipeline*” as an atomic unit.

The pipeline allows us as machine learning practitioners to move up one level of abstraction and be less concerned with the specific outcomes of the algorithms and more concerned with the capability of a sequence of procedures.

As such, we can focus on evaluating the capability of the algorithms on the dataset, not the product of the algorithms, i.e. the model. Once we have an estimate of the pipeline, we can apply it and be confident that we will get similar performance, on average.

It is a shift in thinking and may take some time to get used to.

It is also the philosophy behind modern AutoML (automatic machine learning) techniques that treat applied machine learning as a large combinatorial search problem.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered modeling pipelines for applied machine learning.

Specifically, you learned:

- Applied machine learning is concerned with more than finding a good performing model; it also requires finding an appropriate sequence of data preparation steps and steps for the post-processing of predictions.
- Collectively, the operations required to address a predictive modeling problem can be considered an atomic unit called a modeling pipeline.
- Approaching applied machine learning through the lens of modeling pipelines requires a change in thinking from evaluating specific model configurations to sequences of transforms and algorithms.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Machine Learning Modeling Pipelines appeared first on Machine Learning Mastery.

]]>The post Semi-Supervised Learning With Label Spreading appeared first on Machine Learning Mastery.

]]>Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagates known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the **label spreading algorithm** for classification predictive modeling.

In this tutorial, you will discover how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

- An intuition for how the label spreading semi-supervised learning algorithm works.
- How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
- How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Label Spreading Algorithm
- Semi-Supervised Classification Dataset
- Label Spreading for Semi-Supervised Learning

Label Spreading is a semi-supervised learning algorithm.

The algorithm was introduced by Dengyong Zhou, et al. in their 2003 paper titled “Learning With Local And Global Consistency.”

The intuition for the broader approach of semi-supervised learning is that nearby points in the input space should have the same label, and points in the same structure or manifold in the input space should have the same label.

The key to semi-supervised learning problems is the prior assumption of consistency, which means: (1) nearby points are likely to have the same label; and (2) points on the same structure typically referred to as a cluster or a manifold) are likely to have the same label.

— Learning With Local And Global Consistency, 2003.

The label spreading is inspired by a technique from experimental psychology called spreading activation networks.

This algorithm can be understood intuitively in terms of spreading activation networks from experimental psychology.

— Learning With Local And Global Consistency, 2003.

Points in the dataset are connected in a graph based on their relative distances in the input space. The weight matrix of the graph is normalized symmetrically, much like spectral clustering. Information is passed through the graph, which is adapted to capture the structure in the input space.

The approach is very similar to the label propagation algorithm for semi-supervised learning.

Another similar label propagation algorithm was given by Zhou et al.: at each step a node i receives a contribution from its neighbors j (weighted by the normalized weight of the edge (i,j)), and an additional small contribution given by its initial value

— Page 196, Semi-Supervised Learning, 2006.

After convergence, labels are applied based on nodes that passed on the most information.

Finally, the label of each unlabeled point is set to be the class of which it has received most information during the iteration process.

— Learning With Local And Global Consistency, 2003.

Now that we are familiar with the label spreading algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

... # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

... # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

... # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # summarize training set size print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape) print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape) # summarize test set size print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,) Unlabeled Train Set: (250, 2) (250,) Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

... # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire holdout test dataset and evaluated using classification accuracy.

... # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve a better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label spreading algorithm to the dataset.

The label spreading algorithm is available in the scikit-learn Python machine learning library via the LabelSpreading class.

The model can be fit just like any other classification model by calling the *fit()* function and used to make predictions for new data via the *predict()* function.

... # define model model = LabelSpreading() # fit model on training dataset model.fit(..., ...) # make predictions on hold out test set yhat = model.predict(...)

Importantly, the training dataset provided to the *fit()* function must include labeled examples that are ordinal encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “*transduction_*” attribute on the *LabelSpreading* class.

... # get labels for entire training dataset data tran_labels = model.transduction_

Now that we are familiar with how to use the label spreading algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

... # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

... # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

... # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the *LabelSpreading* model on the entire training dataset.

... # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

... # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating label spreading on the semi-supervised learning dataset is listed below.

# evaluate label spreading on the semi-supervised learning dataset from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelSpreading # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the label spreading model achieves a classification accuracy of about 85.4 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.400

So far so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label spreading model as follows:

... # get labels for entire training dataset data tran_labels = model.transduction_

We can then use these labels, along with all of the input data, to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

... # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label spreading for semi-supervised learning from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelSpreading from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelSpreading() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # get labels for entire training dataset data tran_labels = model.transduction_ # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that this hierarchical approach of semi-supervised model followed by supervised model achieves a classification accuracy of about 85.8 percent on the holdout dataset, slightly better than the semi-supervised learning algorithm used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 85.800

**Can you achieve better results by tuning the hyperparameters of the LabelSpreading model?**

Let me know what you discover in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Semi-Supervised Learning, 2009.
- Chapter 11: Label Propagation and Quadratic Criterion, Semi-Supervised Learning, 2006.

- sklearn.semi_supervised.LabelSpreading API.
- Section 1.14. Semi-Supervised, Scikit-Learn User Guide.
- sklearn.model_selection.train_test_split API.
- sklearn.linear_model.LogisticRegression API.
- sklearn.datasets.make_classification API.

In this tutorial, you discovered how to apply the label spreading algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

- An intuition for how the label spreading semi-supervised learning algorithm works.
- How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
- How to develop and evaluate a label spreading algorithm and use the model output to train a supervised learning algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Spreading appeared first on Machine Learning Mastery.

]]>The post Multinomial Logistic Regression With Python appeared first on Machine Learning Mastery.

]]>Logistic regression, by default, is limited to two-class classification problems. Some extensions like one-vs-rest can allow logistic regression to be used for multi-class classification problems, although they require that the classification problem first be transformed into multiple binary classification problems.

Instead, the multinomial logistic regression algorithm is an extension to the logistic regression model that involves changing the loss function to cross-entropy loss and predict probability distribution to a multinomial probability distribution to natively support multi-class classification problems.

In this tutorial, you will discover how to develop multinomial logistic regression models in Python.

After completing this tutorial, you will know:

- Multinomial logistic regression is an extension of logistic regression for multi-class classification.
- How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
- How to tune the penalty hyperparameter for the multinomial logistic regression model.

Let’s get started.

This tutorial is divided into three parts; they are:

- Multinomial Logistic Regression
- Evaluate Multinomial Logistic Regression Model
- Tune Penalty for Multinomial Logistic Regression

Logistic regression is a classification algorithm.

It is intended for datasets that have numerical input variables and a categorical target variable that has two values or classes. Problems of this type are referred to as binary classification problems.

Logistic regression is designed for two-class problems, modeling the target using a binomial probability distribution function. The class labels are mapped to 1 for the positive class or outcome and 0 for the negative class or outcome. The fit model predicts the probability that an example belongs to class 1.

By default, logistic regression cannot be used for classification tasks that have more than two class labels, so-called multi-class classification.

Instead, it requires modification to support multi-class classification problems.

One popular approach for adapting logistic regression to multi-class classification problems is to split the multi-class classification problem into multiple binary classification problems and fit a standard logistic regression model on each subproblem. Techniques of this type include one-vs-rest and one-vs-one wrapper models.

An alternate approach involves changing the logistic regression model to support the prediction of multiple class labels directly. Specifically, to predict the probability that an input example belongs to each known class label.

The probability distribution that defines multi-class probabilities is called a multinomial probability distribution. A logistic regression model that is adapted to learn and predict a multinomial probability distribution is referred to as Multinomial Logistic Regression. Similarly, we might refer to default or standard logistic regression as Binomial Logistic Regression.

**Binomial Logistic Regression**: Standard logistic regression that predicts a binomial probability (i.e. for two classes) for each input example.**Multinomial Logistic Regression**: Modified version of logistic regression that predicts a multinomial probability (i.e. more than two classes) for each input example.

If you are new to binomial and multinomial probability distributions, you may want to read the tutorial:

Changing logistic regression from binomial to multinomial probability requires a change to the loss function used to train the model (e.g. log loss to cross-entropy loss), and a change to the output from a single probability value to one probability for each class label.

Now that we are familiar with multinomial logistic regression, let’s look at how we might develop and evaluate multinomial logistic regression models in Python.

In this section, we will develop and evaluate a multinomial logistic regression model using the scikit-learn Python machine learning library.

First, we will define a synthetic multi-class classification dataset to use as the basis of the investigation. This is a generic dataset that you can easily replace with your own loaded dataset later.

The make_classification() function can be used to generate a dataset with a given number of rows, columns, and classes. In this case, we will generate a dataset with 1,000 rows, 10 input variables or columns, and 3 classes.

The example below generates the dataset and summarizes the shape of the arrays and the distribution of examples across the three classes.

# test classification dataset from collections import Counter from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # summarize the dataset print(X.shape, y.shape) print(Counter(y))

Running the example confirms that the dataset has 1,000 rows and 10 columns, as we expected, and that the rows are distributed approximately evenly across the three classes, with about 334 examples in each class.

(1000, 10) (1000,) Counter({1: 334, 2: 334, 0: 332})

Logistic regression is supported in the scikit-learn library via the LogisticRegression class.

The *LogisticRegression* class can be configured for multinomial logistic regression by setting the “*multi_class*” argument to “*multinomial*” and the “*solver*” argument to a solver that supports multinomial logistic regression, such as “*lbfgs*“.

... # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs')

The multinomial logistic regression model will be fit using cross-entropy loss and will predict the integer value for each integer encoded class label.

Now that we are familiar with the multinomial logistic regression API, we can look at how we might evaluate a multinomial logistic regression model on our synthetic multi-class classification dataset.

It is a good practice to evaluate classification models using repeated stratified k-fold cross-validation. The stratification ensures that each cross-validation fold has approximately the same distribution of examples in each class as the whole training dataset.

We will use three repeats with 10 folds, which is a good default, and evaluate model performance using classification accuracy given that the classes are balanced.

The complete example of evaluating multinomial logistic regression for multi-class classification is listed below.

# evaluate multinomial logistic regression model from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') # define the model evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model and collect the scores n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # report the model performance print('Mean Accuracy: %.3f (%.3f)' % (mean(n_scores), std(n_scores)))

Running the example reports the mean classification accuracy across all folds and repeats of the evaluation procedure.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the multinomial logistic regression model with default penalty achieved a mean classification accuracy of about 68.1 percent on our synthetic classification dataset.

Mean Accuracy: 0.681 (0.042)

We may decide to use the multinomial logistic regression model as our final model and make predictions on new data.

This can be achieved by first fitting the model on all available data, then calling the *predict()* function to make a prediction for new data.

The example below demonstrates how to make a prediction for new data using the multinomial logistic regression model.

# make a prediction with a multinomial logistic regression model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') # fit the model on the whole dataset model.fit(X, y) # define a single row of input data row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426] # predict the class label yhat = model.predict([row]) # summarize the predicted class print('Predicted Class: %d' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to make a prediction.

In this case, we can see that the model predicted the class “1” for the single row of data.

Predicted Class: 1

A benefit of multinomial logistic regression is that it can predict calibrated probabilities across all known class labels in the dataset.

This can be achieved by calling the *predict_proba()* function on the model.

The example below demonstrates how to predict a multinomial probability distribution for a new example using the multinomial logistic regression model.

# predict probabilities with a multinomial logistic regression model from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, n_redundant=5, n_classes=3, random_state=1) # define the multinomial logistic regression model model = LogisticRegression(multi_class='multinomial', solver='lbfgs') # fit the model on the whole dataset model.fit(X, y) # define a single row of input data row = [1.89149379, -0.39847585, 1.63856893, 0.01647165, 1.51892395, -3.52651223, 1.80998823, 0.58810926, -0.02542177, -0.52835426] # predict a multinomial probability distribution yhat = model.predict_proba([row]) # summarize the predicted probabilities print('Predicted Probabilities: %s' % yhat[0])

Running the example first fits the model on all available data, then defines a row of data, which is provided to the model in order to predict class probabilities.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that class 1 (e.g. the array index is mapped to the class integer value) has the largest predicted probability with about 0.50.

Predicted Probabilities: [0.16470456 0.50297138 0.33232406]

Now that we are familiar with evaluating and using multinomial logistic regression models, let’s explore how we might tune the model hyperparameters.

An important hyperparameter to tune for multinomial logistic regression is the penalty term.

This term imposes pressure on the model to seek smaller model weights. This is achieved by adding a weighted sum of the model coefficients to the loss function, encouraging the model to reduce the size of the weights along with the error while fitting the model.

A popular type of penalty is the L2 penalty that adds the (weighted) sum of the squared coefficients to the loss function. A weighting of the coefficients can be used that reduces the strength of the penalty from full penalty to a very slight penalty.

By default, the *LogisticRegression* class uses the L2 penalty with a weighting of coefficients set to 1.0. The type of penalty can be set via the “*penalty*” argument with values of “*l1*“, “*l2*“, “*elasticnet*” (e.g. both), although not all solvers support all penalty types. The weighting of the coefficients in the penalty can be set via the “*C*” argument.

... # define the multinomial logistic regression model with a default penalty LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=1.0)

The weighting for the penalty is actually the inverse weighting, perhaps penalty = 1 – C.

From the documentation:

C : float, default=1.0

Inverse of regularization strength; must be a positive float. Like in support vector machines, smaller values specify stronger regularization.

This means that values close to 1.0 indicate very little penalty and values close to zero indicate a strong penalty. A C value of 1.0 may indicate no penalty at all.

**C close to 1.0**: Light penalty.**C close to 0.0**: Strong penalty.

The penalty can be disabled by setting the “*penalty*” argument to the string “*none*“.

... # define the multinomial logistic regression model without a penalty LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none')

Now that we are familiar with the penalty, let’s look at how we might explore the effect of different penalty values on the performance of the multinomial logistic regression model.

It is common to test penalty values on a log scale in order to quickly discover the scale of penalty that works well for a model. Once found, further tuning at that scale may be beneficial.

We will explore the L2 penalty with weighting values in the range from 0.0001 to 1.0 on a log scale, in addition to no penalty or 0.0.

The complete example of evaluating L2 penalty values for multinomial logistic regression is listed below.

# tune regularization for multinomial logistic regression from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import LogisticRegression from matplotlib import pyplot # get the dataset def get_dataset(): X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=1, n_classes=3) return X, y # get a list of models to evaluate def get_models(): models = dict() for p in [0.0, 0.0001, 0.001, 0.01, 0.1, 1.0]: # create name for model key = '%.4f' % p # turn off penalty in some cases if p == 0.0: # no penalty in this case models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='none') else: models[key] = LogisticRegression(multi_class='multinomial', solver='lbfgs', penalty='l2', C=p) return models # evaluate a give model using cross-validation def evaluate_model(model, X, y): # define the evaluation procedure cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate the model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) return scores # define dataset X, y = get_dataset() # get the models to evaluate models = get_models() # evaluate the models and store results results, names = list(), list() for name, model in models.items(): # evaluate the model and collect the scores scores = evaluate_model(model, X, y) # store the results results.append(scores) names.append(name) # summarize progress along the way print('>%s %.3f (%.3f)' % (name, mean(scores), std(scores))) # plot model performance for comparison pyplot.boxplot(results, labels=names, showmeans=True) pyplot.show()

Running the example reports the mean classification accuracy for each configuration along the way.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a C value of 1.0 has the best score of about 77.7 percent, which is the same as using no penalty that achieves the same score.

>0.0000 0.777 (0.037) >0.0001 0.683 (0.049) >0.0010 0.762 (0.044) >0.0100 0.775 (0.040) >0.1000 0.774 (0.038) >1.0000 0.777 (0.037)

A box and whisker plot is created for the accuracy scores for each configuration and all plots are shown side by side on a figure on the same scale for direct comparison.

In this case, we can see that the larger penalty we use on this dataset (i.e. the smaller the C value), the worse the performance of the model.

This section provides more resources on the topic if you are looking to go deeper.

- Logistic Regression Tutorial for Machine Learning
- Logistic Regression for Machine Learning
- A Gentle Introduction to Logistic Regression With Maximum Likelihood Estimation
- How To Implement Logistic Regression From Scratch in Python
- Cost-Sensitive Logistic Regression for Imbalanced Classification

In this tutorial, you discovered how to develop multinomial logistic regression models in Python.

Specifically, you learned:

- Multinomial logistic regression is an extension of logistic regression for multi-class classification.
- How to develop and evaluate multinomial logistic regression and develop a final model for making predictions on new data.
- How to tune the penalty hyperparameter for the multinomial logistic regression model.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Multinomial Logistic Regression With Python appeared first on Machine Learning Mastery.

]]>The post Semi-Supervised Learning With Label Propagation appeared first on Machine Learning Mastery.

]]>Semi-supervised learning algorithms are unlike supervised learning algorithms that are only able to learn from labeled training data.

A popular approach to semi-supervised learning is to create a graph that connects examples in the training dataset and propagate known labels through the edges of the graph to label unlabeled examples. An example of this approach to semi-supervised learning is the **label propagation algorithm** for classification predictive modeling.

In this tutorial, you will discover how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

After completing this tutorial, you will know:

- An intuition for how the label propagation semi-supervised learning algorithm works.
- How to develop a semi-supervised classification dataset and establish a baseline in performance with a supervised learning algorithm.
- How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

Let’s get started.

This tutorial is divided into three parts; they are:

- Label Propagation Algorithm
- Semi-Supervised Classification Dataset
- Label Propagation for Semi-Supervised Learning

Label Propagation is a semi-supervised learning algorithm.

The algorithm was proposed in the 2002 technical report by Xiaojin Zhu and Zoubin Ghahramani titled “Learning From Labeled And Unlabeled Data With Label Propagation.”

The intuition for the algorithm is that a graph is created that connects all examples (rows) in the dataset based on their distance, such as Euclidean distance. Nodes in the graph then have label soft labels or label distribution based on the labels or label distributions of examples connected nearby in the graph.

Many semi-supervised learning algorithms rely on the geometry of the data induced by both labeled and unlabeled examples to improve on supervised methods that use only the labeled data. This geometry can be naturally represented by an empirical graph g = (V,E) where nodes V = {1,…,n} represent the training data and edges E represent similarities between them

— Page 193, Semi-Supervised Learning, 2006.

Propagation refers to the iterative nature that labels are assigned to nodes in the graph and propagate along the edges of the graph to connected nodes.

This procedure is sometimes called label propagation, as it “propagates” labels from the labeled vertices (which are fixed) gradually through the edges to all the unlabeled vertices.

— Page 48, Introduction to Semi-Supervised Learning, 2009.

The process is repeated for a fixed number of iterations to strengthen the labels assigned to unlabeled examples.

Starting with nodes 1, 2,…,l labeled with their known label (1 or −1) and nodes l + 1,…,n labeled with 0, each node starts to propagate its label to its neighbors, and the process is repeated until convergence.

— Page 194, Semi-Supervised Learning, 2006.

Now that we are familiar with the Label Propagation algorithm, let’s look at how we might use it on a project. First, we must define a semi-supervised classification dataset.

In this section, we will define a dataset for semis-supervised learning and establish a baseline in performance on the dataset.

First, we can define a synthetic classification dataset using the make_classification() function.

We will define the dataset with two classes (binary classification) and two input variables and 1,000 examples.

... # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1)

Next, we will split the dataset into train and test datasets with an equal 50-50 split (e.g. 500 rows in each).

... # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

Finally, we will split the training dataset in half again into a portion that will have labels and a portion that we will pretend is unlabeled.

... # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train)

Tying this together, the complete example of preparing the semi-supervised learning dataset is listed below.

# prepare semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # summarize training set size print('Labeled Train Set:', X_train_lab.shape, y_train_lab.shape) print('Unlabeled Train Set:', X_test_unlab.shape, y_test_unlab.shape) # summarize test set size print('Test Set:', X_test.shape, y_test.shape)

Running the example prepares the dataset and then summarizes the shape of each of the three portions.

The results confirm that we have a test dataset of 500 rows, a labeled training dataset of 250 rows, and 250 rows of unlabeled data.

Labeled Train Set: (250, 2) (250,) Unlabeled Train Set: (250, 2) (250,) Test Set: (500, 2) (500,)

A supervised learning algorithm will only have 250 rows from which to train a model.

A semi-supervised learning algorithm will have the 250 labeled rows as well as the 250 unlabeled rows that could be used in numerous ways to improve the labeled training dataset.

Next, we can establish a baseline in performance on the semi-supervised learning dataset using a supervised learning algorithm fit only on the labeled training data.

This is important because we would expect a semi-supervised learning algorithm to outperform a supervised learning algorithm fit on the labeled data alone. If this is not the case, then the semi-supervised learning algorithm does not have skill.

In this case, we will use a logistic regression algorithm fit on the labeled portion of the training dataset.

... # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab)

The model can then be used to make predictions on the entire hold out test dataset and evaluated using classification accuracy.

... # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of evaluating a supervised learning algorithm on the semi-supervised learning dataset is listed below.

# baseline performance on the semi-supervised learning dataset from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # define model model = LogisticRegression() # fit model on labeled dataset model.fit(X_train_lab, y_train_lab) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the labeled training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm achieved a classification accuracy of about 84.8 percent.

We would expect an effective semi-supervised learning algorithm to achieve better accuracy than this.

Accuracy: 84.800

Next, let’s explore how to apply the label propagation algorithm to the dataset.

The Label Propagation algorithm is available in the scikit-learn Python machine learning library via the LabelPropagation class.

The model can be fit just like any other classification model by calling the *fit()* function and used to make predictions for new data via the *predict()* function.

... # define model model = LabelPropagation() # fit model on training dataset model.fit(..., ...) # make predictions on hold out test set yhat = model.predict(...)

Importantly, the training dataset provided to the *fit()* function must include labeled examples that are integer encoded (as per normal) and unlabeled examples marked with a label of -1.

The model will then determine a label for the unlabeled examples as part of fitting the model.

After the model is fit, the estimated labels for the labeled and unlabeled data in the training dataset is available via the “*transduction_*” attribute on the *LabelPropagation* class.

... # get labels for entire training dataset data tran_labels = model.transduction_

Now that we are familiar with how to use the Label Propagation algorithm in scikit-learn, let’s look at how we might apply it to our semi-supervised learning dataset.

First, we must prepare the training dataset.

We can concatenate the input data of the training dataset into a single array.

... # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab))

We can then create a list of -1 valued (unlabeled) for each row in the unlabeled portion of the training dataset.

... # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))]

This list can then be concatenated with the labels from the labeled portion of the training dataset to correspond with the input array for the training dataset.

... # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel))

We can now train the *LabelPropagation* model on the entire training dataset.

... # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed)

Next, we can use the model to make predictions on the holdout dataset and evaluate the model using classification accuracy.

Tying this together, the complete example of evaluating label propagation on the semi-supervised learning dataset is listed below.

# evaluate label propagation on the semi-supervised learning dataset from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelPropagation # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # make predictions on hold out test set yhat = model.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the model on the entire training dataset and evaluates it on the holdout dataset and prints the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the label propagation model achieves a classification accuracy of about 85.6 percent, which is slightly higher than a logistic regression fit only on the labeled training dataset that achieved an accuracy of about 84.8 percent.

Accuracy: 85.600

So far, so good.

Another approach we can use with the semi-supervised model is to take the estimated labels for the training dataset and fit a supervised learning model.

Recall that we can retrieve the labels for the entire training dataset from the label propagation model as follows:

... # get labels for entire training dataset data tran_labels = model.transduction_

We can then use these labels along with all of the input data to train and evaluate a supervised learning algorithm, such as a logistic regression model.

The hope is that the supervised learning model fit on the entire training dataset would achieve even better performance than the semi-supervised learning model alone.

... # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Tying this together, the complete example of using the estimated training set labels to train and evaluate a supervised learning model is listed below.

# evaluate logistic regression fit on label propagation for semi-supervised learning from numpy import concatenate from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.semi_supervised import LabelPropagation from sklearn.linear_model import LogisticRegression # define dataset X, y = make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, random_state=1) # split into train and test X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y) # split train into labeled and unlabeled X_train_lab, X_test_unlab, y_train_lab, y_test_unlab = train_test_split(X_train, y_train, test_size=0.50, random_state=1, stratify=y_train) # create the training dataset input X_train_mixed = concatenate((X_train_lab, X_test_unlab)) # create "no label" for unlabeled data nolabel = [-1 for _ in range(len(y_test_unlab))] # recombine training dataset labels y_train_mixed = concatenate((y_train_lab, nolabel)) # define model model = LabelPropagation() # fit model on training dataset model.fit(X_train_mixed, y_train_mixed) # get labels for entire training dataset data tran_labels = model.transduction_ # define supervised learning model model2 = LogisticRegression() # fit supervised learning model on entire training dataset model2.fit(X_train_mixed, tran_labels) # make predictions on hold out test set yhat = model2.predict(X_test) # calculate score for test set score = accuracy_score(y_test, yhat) # summarize score print('Accuracy: %.3f' % (score*100))

Running the algorithm fits the semi-supervised model on the entire training dataset, then fits a supervised learning model on the entire training dataset with inferred labels and evaluates it on the holdout dataset, printing the classification accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that this hierarchical approach of the semi-supervised model followed by supervised model achieves a classification accuracy of about 86.2 percent on the holdout dataset, even better than the semi-supervised learning used alone that achieved an accuracy of about 85.6 percent.

Accuracy: 86.200

**Can you achieve better results by tuning the hyperparameters of the LabelPropagation model?**

Let me know what you discover in the comments below.

This section provides more resources on the topic if you are looking to go deeper.

- Introduction to Semi-Supervised Learning, 2009.
- Chapter 11: Label Propagation and Quadratic Criterion, Semi-Supervised Learning, 2006.

- sklearn.semi_supervised.LabelPropagation API.
- Section 1.14. Semi-Supervised, Scikit-Learn User Guide.
- sklearn.model_selection.train_test_split API.
- sklearn.linear_model.LogisticRegression API.
- sklearn.datasets.make_classification API.

In this tutorial, you discovered how to apply the label propagation algorithm to a semi-supervised learning classification dataset.

Specifically, you learned:

- An intuition for how the label propagation semi-supervised learning algorithm works.
- How to develop and evaluate a label propagation algorithm and use the model output to train a supervised learning algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Semi-Supervised Learning With Label Propagation appeared first on Machine Learning Mastery.

]]>The post Perceptron Algorithm for Classification in Python appeared first on Machine Learning Mastery.

]]>It may be considered one of the first and one of the simplest types of artificial neural networks. It is definitely not “deep” learning but is an important building block.

Like logistic regression, it can quickly learn a linear separation in feature space for two-class classification tasks, although unlike logistic regression, it learns using the stochastic gradient descent optimization algorithm and does not predict calibrated probabilities.

In this tutorial, you will discover the Perceptron classification machine learning algorithm.

After completing this tutorial, you will know:

- The Perceptron Classifier is a linear algorithm that can be applied to binary classification tasks.
- How to fit, evaluate, and make predictions with the Perceptron model with Scikit-Learn.
- How to tune the hyperparameters of the Perceptron algorithm on a given dataset.

Let’s get started.

This tutorial is divided into 3=three parts; they are:

- Perceptron Algorithm
- Perceptron With Scikit-Learn
- Tune Perceptron Hyperparameters

The Perceptron algorithm is a two-class (binary) classification machine learning algorithm.

It is a type of neural network model, perhaps the simplest type of neural network model.

It consists of a single node or neuron that takes a row of data as input and predicts a class label. This is achieved by calculating the weighted sum of the inputs and a bias (set to 1). The weighted sum of the input of the model is called the activation.

**Activation**= Weights * Inputs + Bias

If the activation is above 0.0, the model will output 1.0; otherwise, it will output 0.0.

**Predict 1**: If Activation > 0.0**Predict 0**: If Activation <= 0.0

Given that the inputs are multiplied by model coefficients, like linear regression and logistic regression, it is good practice to normalize or standardize data prior to using the model.

The Perceptron is a linear classification algorithm. This means that it learns a decision boundary that separates two classes using a line (called a hyperplane) in the feature space. As such, it is appropriate for those problems where the classes can be separated well by a line or linear model, referred to as linearly separable.

The coefficients of the model are referred to as input weights and are trained using the stochastic gradient descent optimization algorithm.

Examples from the training dataset are shown to the model one at a time, the model makes a prediction, and error is calculated. The weights of the model are then updated to reduce the errors for the example. This is called the Perceptron update rule. This process is repeated for all examples in the training dataset, called an epoch. This process of updating the model using examples is then repeated for many epochs.

Model weights are updated with a small proportion of the error each batch, and the proportion is controlled by a hyperparameter called the learning rate, typically set to a small value. This is to ensure learning does not occur too quickly, resulting in a possibly lower skill model, referred to as premature convergence of the optimization (search) procedure for the model weights.

- weights(t + 1) = weights(t) + learning_rate * (expected_i – predicted_) * input_i

Training is stopped when the error made by the model falls to a low level or no longer improves, or a maximum number of epochs is performed.

The initial values for the model weights are set to small random values. Additionally, the training dataset is shuffled prior to each training epoch. This is by design to accelerate and improve the model training process. Because of this, the learning algorithm is stochastic and may achieve different results each time it is run. As such, it is good practice to summarize the performance of the algorithm on a dataset using repeated evaluation and reporting the mean classification accuracy.

The learning rate and number of training epochs are hyperparameters of the algorithm that can be set using heuristics or hyperparameter tuning.

For more about the Perceptron algorithm, see the tutorial:

Now that we are familiar with the Perceptron algorithm, let’s explore how we can use the algorithm in Python.

The Perceptron algorithm is available in the scikit-learn Python machine learning library via the Perceptron class.

The class allows you to configure the learning rate (*eta0*), which defaults to 1.0.

... # define model model = Perceptron(eta0=1.0)

The implementation also allows you to configure the total number of training epochs (*max_iter*), which defaults to 1,000.

... # define model model = Perceptron(max_iter=1000)

The scikit-learn implementation of the Perceptron algorithm also provides other configuration options that you may want to explore, such as early stopping and the use of a penalty loss.

We can demonstrate the Perceptron classifier with a worked example.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to create a dataset with 1,000 examples, each with 20 input variables.

The example creates and summarizes the dataset.

# test classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and confirms the number of rows and columns of the dataset.

(1000, 10) (1000,)

We can fit and evaluate a Perceptron model using repeated stratified k-fold cross-validation via the RepeatedStratifiedKFold class. We will use 10 folds and three repeats in the test harness.

We will use the default configuration.

... # create the model model = Perceptron()

The complete example of evaluating the Perceptron model for the synthetic binary classification task is listed below.

# evaluate a perceptron model on the dataset from numpy import mean from numpy import std from sklearn.datasets import make_classification from sklearn.model_selection import cross_val_score from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = Perceptron() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # evaluate model scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1) # summarize result print('Mean Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Running the example evaluates the Perceptron algorithm on the synthetic dataset and reports the average accuracy across the three repeats of 10-fold cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Consider running the example a few times.

In this case, we can see that the model achieved a mean accuracy of about 84.7 percent.

Mean Accuracy: 0.847 (0.052)

We may decide to use the Perceptron classifier as our final model and make predictions on new data.

This can be achieved by fitting the model pipeline on all available data and calling the predict() function passing in a new row of data.

We can demonstrate this with a complete example listed below.

# make a prediction with a perceptron model on the dataset from sklearn.datasets import make_classification from sklearn.linear_model import Perceptron # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = Perceptron() # fit model model.fit(X, y) # define new data row = [0.12777556,-3.64400522,-2.23268854,-1.82114386,1.75466361,0.1243966,1.03397657,2.35822076,1.01001752,0.56768485] # make a prediction yhat = model.predict([row]) # summarize prediction print('Predicted Class: %d' % yhat)

Running the example fits the model and makes a class label prediction for a new row of data.

Predicted Class: 1

Next, we can look at configuring the model hyperparameters.

The hyperparameters for the Perceptron algorithm must be configured for your specific dataset.

Perhaps the most important hyperparameter is the learning rate.

A large learning rate can cause the model to learn fast, but perhaps at the cost of lower skill. A smaller learning rate can result in a better-performing model but may take a long time to train the model.

You can learn more about exploring learning rates in the tutorial:

It is common to test learning rates on a log scale between a small value such as 1e-4 (or smaller) and 1.0. We will test the following values in this case:

... # define grid grid = dict() grid['eta0'] = [0.0001, 0.001, 0.01, 0.1, 1.0]

The example below demonstrates this using the GridSearchCV class with a grid of values we have defined.

# grid search learning rate for the perceptron from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = Perceptron() # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['eta0'] = [0.0001, 0.001, 0.01, 0.1, 1.0] # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_) # summarize all means = results.cv_results_['mean_test_score'] params = results.cv_results_['params'] for mean, param in zip(means, params): print(">%.3f with: %r" % (mean, param))

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that a smaller learning rate than the default results in better performance with learning rate 0.0001 and 0.001 both achieving a classification accuracy of about 85.7 percent as compared to the default of 1.0 that achieved an accuracy of about 84.7 percent.

Mean Accuracy: 0.857 Config: {'eta0': 0.0001} >0.857 with: {'eta0': 0.0001} >0.857 with: {'eta0': 0.001} >0.853 with: {'eta0': 0.01} >0.847 with: {'eta0': 0.1} >0.847 with: {'eta0': 1.0}

Another important hyperparameter is how many epochs are used to train the model.

This may depend on the training dataset and could vary greatly. Again, we will explore configuration values on a log scale between 1 and 1e+4.

... # define grid grid = dict() grid['max_iter'] = [1, 10, 100, 1000, 10000]

We will use our well-performing learning rate of 0.0001 found in the previous search.

... # define model model = Perceptron(eta0=0.0001)

The complete example of grid searching the number of training epochs is listed below.

# grid search total epochs for the perceptron from sklearn.datasets import make_classification from sklearn.model_selection import GridSearchCV from sklearn.model_selection import RepeatedStratifiedKFold from sklearn.linear_model import Perceptron # define dataset X, y = make_classification(n_samples=1000, n_features=10, n_informative=10, n_redundant=0, random_state=1) # define model model = Perceptron(eta0=0.0001) # define model evaluation method cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1) # define grid grid = dict() grid['max_iter'] = [1, 10, 100, 1000, 10000] # define search search = GridSearchCV(model, grid, scoring='accuracy', cv=cv, n_jobs=-1) # perform the search results = search.fit(X, y) # summarize print('Mean Accuracy: %.3f' % results.best_score_) print('Config: %s' % results.best_params_) # summarize all means = results.cv_results_['mean_test_score'] params = results.cv_results_['params'] for mean, param in zip(means, params): print(">%.3f with: %r" % (mean, param))

Running the example will evaluate each combination of configurations using repeated cross-validation.

Your specific results may vary given the stochastic nature of the learning algorithm. Try running the example a few times.

In this case, we can see that epochs 10 to 10,000 result in about the same classification accuracy. An interesting exception would be to explore configuring learning rate and number of training epochs at the same time to see if better results can be achieved.

Mean Accuracy: 0.857 Config: {'max_iter': 10} >0.850 with: {'max_iter': 1} >0.857 with: {'max_iter': 10} >0.857 with: {'max_iter': 100} >0.857 with: {'max_iter': 1000} >0.857 with: {'max_iter': 10000}

This section provides more resources on the topic if you are looking to go deeper.

- How To Implement The Perceptron Algorithm From Scratch In Python
- Understand the Impact of Learning Rate on Neural Network Performance
- How to Configure the Learning Rate When Training Deep Learning Neural Networks

- Neural Networks for Pattern Recognition, 1995.
- Pattern Recognition and Machine Learning, 2006.
- Artificial Intelligence: A Modern Approach, 3rd edition, 2015.

In this tutorial, you discovered the Perceptron classification machine learning algorithm.

Specifically, you learned:

- The Perceptron Classifier is a linear algorithm that can be applied to binary classification tasks.
- How to fit, evaluate, and make predictions with the Perceptron model with Scikit-Learn.
- How to tune the hyperparameters of the Perceptron algorithm on a given dataset.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Perceptron Algorithm for Classification in Python appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to PyCaret for Machine Learning appeared first on Machine Learning Mastery.

]]>It is a Python version of the Caret machine learning package in R, popular because it allows models to be evaluated, compared, and tuned on a given dataset with just a few lines of code.

The PyCaret library provides these features, allowing the machine learning practitioner in Python to spot check a suite of standard machine learning algorithms on a classification or regression dataset with a single function call.

In this tutorial, you will discover the PyCaret Python open source library for machine learning.

After completing this tutorial, you will know:

- PyCaret is a Python version of the popular and widely used caret machine learning package in R.
- How to use PyCaret to easily evaluate and compare standard machine learning models on a dataset.
- How to use PyCaret to easily tune the hyperparameters of a well-performing machine learning model.

Let’s get started.

This tutorial is divided into four parts; they are:

- What Is PyCaret?
- Sonar Dataset
- Comparing Machine Learning Models
- Tuning Machine Learning Models

PyCaret is an open source Python machine learning library inspired by the caret R package.

The goal of the caret package is to automate the major steps for evaluating and comparing machine learning algorithms for classification and regression. The main benefit of the library is that a lot can be achieved with very few lines of code and little manual configuration. The PyCaret library brings these capabilities to Python.

PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding.

The PyCaret library automates many steps of a machine learning project, such as:

- Defining the data transforms to perform (
*setup()*) - Evaluating and comparing standard models (
*compare_models()*) - Tuning model hyperparameters (
*tune_model()*)

As well as many more features not limited to creating ensembles, saving models, and deploying models.

The PyCaret library has a wealth of documentation for using the API; you can get started here:

We will not explore all of the features of the library in this tutorial; instead, we will focus on simple machine learning model comparison and hyperparameter tuning.

You can install PyCaret using your Python package manager, such as pip. For example:

pip install pycaret

Once installed, we can confirm that the library is available in your development environment and is working correctly by printing the installed version.

# check pycaret version import pycaret print('PyCaret: %s' % pycaret.__version__)

Running the example will load the PyCaret library and print the installed version number.

Your version number should be the same or higher.

PyCaret: 2.0.0

If you need help installing PyCaret for your system, you can see the installation instructions here:

Now that we are familiar with what PyCaret is, let’s explore how we might use it on a machine learning project.

We will use the Sonar standard binary classification dataset. You can learn more about it here:

We can download the dataset directly from the URL and load it as a Pandas DataFrame.

... # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' # load the dataset df = read_csv(url, header=None) # summarize the shape of the dataset print(df.shape)

The PyCaret seems to require that a dataset has column names, and our dataset does not have column names, so we can set the column number as the column name directly.

... # set column names as the column number n_cols = df.shape[1] df.columns = [str(i) for i in range(n_cols)]

Finally, we can summarize the first few rows of data.

... # summarize the first few rows of data print(df.head())

Tying this together, the complete example of loading and summarizing the Sonar dataset is listed below.

# load the sonar dataset from pandas import read_csv # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' # load the dataset df = read_csv(url, header=None) # summarize the shape of the dataset print(df.shape) # set column names as the column number n_cols = df.shape[1] df.columns = [str(i) for i in range(n_cols)] # summarize the first few rows of data print(df.head())

Running the example first loads the dataset and reports the shape, showing it has 208 rows and 61 columns.

The first five rows are then printed showing that the input variables are all numeric and the target variable is column “60” and has string labels.

(208, 61) 0 1 2 3 4 ... 56 57 58 59 60 0 0.0200 0.0371 0.0428 0.0207 0.0954 ... 0.0180 0.0084 0.0090 0.0032 R 1 0.0453 0.0523 0.0843 0.0689 0.1183 ... 0.0140 0.0049 0.0052 0.0044 R 2 0.0262 0.0582 0.1099 0.1083 0.0974 ... 0.0316 0.0164 0.0095 0.0078 R 3 0.0100 0.0171 0.0623 0.0205 0.0205 ... 0.0050 0.0044 0.0040 0.0117 R 4 0.0762 0.0666 0.0481 0.0394 0.0590 ... 0.0072 0.0048 0.0107 0.0094 R

Next, we can use PyCaret to evaluate and compare a suite of standard machine learning algorithms to quickly discover what works well on this dataset.

In this section, we will evaluate and compare the performance of standard machine learning models on the Sonar classification dataset.

First, we must set the dataset with the PyCaret library via the setup() function. This requires that we provide the Pandas DataFrame and specify the name of the column that contains the target variable.

The *setup()* function also allows you to configure simple data preparation, such as scaling, power transforms, missing data handling, and PCA transforms.

We will specify the data, target variable, and turn off HTML output, verbose output, and requests for user feedback.

... # setup the dataset grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False)

Next, we can compare standard machine learning models by calling the *compare_models()* function.

By default, it will evaluate models using 10-fold cross-validation, sort results by classification accuracy, and return the single best model.

These are good defaults, and we don’t need to change a thing.

... # evaluate models and compare models best = compare_models()

Call the *compare_models()* function will also report a table of results summarizing all of the models that were evaluated and their performance.

Finally, we can report the best-performing model and its configuration.

Tying this together, the complete example of evaluating a suite of standard models on the Sonar classification dataset is listed below.

# compare machine learning algorithms on the sonar classification dataset from pandas import read_csv from pycaret.classification import setup from pycaret.classification import compare_models # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' # load the dataset df = read_csv(url, header=None) # set column names as the column number n_cols = df.shape[1] df.columns = [str(i) for i in range(n_cols)] # setup the dataset grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False) # evaluate models and compare models best = compare_models() # report the best model print(best)

Running the example will load the dataset, configure the PyCaret library, evaluate a suite of standard models, and report the best model found for the dataset.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the “*Extra Trees Classifier*” has the best accuracy on the dataset with a score of about 86.95 percent.

We can then see the configuration of the model that was used, which looks like it used default hyperparameter values.

Model Accuracy AUC Recall Prec. F1 \ 0 Extra Trees Classifier 0.8695 0.9497 0.8571 0.8778 0.8631 1 CatBoost Classifier 0.8695 0.9548 0.8143 0.9177 0.8508 2 Light Gradient Boosting Machine 0.8219 0.9096 0.8000 0.8327 0.8012 3 Gradient Boosting Classifier 0.8010 0.8801 0.7690 0.8110 0.7805 4 Ada Boost Classifier 0.8000 0.8474 0.7952 0.8071 0.7890 5 K Neighbors Classifier 0.7995 0.8613 0.7405 0.8276 0.7773 6 Extreme Gradient Boosting 0.7995 0.8934 0.7833 0.8095 0.7802 7 Random Forest Classifier 0.7662 0.8778 0.6976 0.8024 0.7345 8 Decision Tree Classifier 0.7533 0.7524 0.7119 0.7655 0.7213 9 Ridge Classifier 0.7448 0.0000 0.6952 0.7574 0.7135 10 Naive Bayes 0.7214 0.8159 0.8286 0.6700 0.7308 11 SVM - Linear Kernel 0.7181 0.0000 0.6286 0.7146 0.6309 12 Logistic Regression 0.7100 0.8104 0.6357 0.7263 0.6634 13 Linear Discriminant Analysis 0.6924 0.7510 0.6667 0.6762 0.6628 14 Quadratic Discriminant Analysis 0.5800 0.6308 0.1095 0.5000 0.1750 Kappa MCC TT (Sec) 0 0.7383 0.7446 0.1415 1 0.7368 0.7552 1.9930 2 0.6410 0.6581 0.0134 3 0.5989 0.6090 0.1413 4 0.5979 0.6123 0.0726 5 0.5957 0.6038 0.0019 6 0.5970 0.6132 0.0287 7 0.5277 0.5438 0.1107 8 0.5028 0.5192 0.0035 9 0.4870 0.5003 0.0030 10 0.4488 0.4752 0.0019 11 0.4235 0.4609 0.0024 12 0.4143 0.4285 0.0059 13 0.3825 0.3927 0.0034 14 0.1172 0.1792 0.0033 ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1, oob_score=False, random_state=2728, verbose=0, warm_start=False)

We could use this configuration directly and fit a model on the entire dataset and use it to make predictions on new data.

We can also use the table of results to get an idea of the types of models that perform well on the dataset, in this case, ensembles of decision trees.

Now that we are familiar with how to compare machine learning models using PyCaret, let’s look at how we might use the library to tune model hyperparameters.

In this section, we will tune the hyperparameters of a machine learning model on the Sonar classification dataset.

We must load and set up the dataset as we did before when comparing models.

... # setup the dataset grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False)

We can tune model hyperparameters using the *tune_model()* function in the PyCaret library.

The function takes an instance of the model to tune as input and knows what hyperparameters to tune automatically. A random search of model hyperparameters is performed and the total number of evaluations can be controlled via the “*n_iter*” argument.

By default, the function will optimize the ‘*Accuracy*‘ and will evaluate the performance of each configuration using 10-fold cross-validation, although this sensible default configuration can be changed.

We can perform a random search of the extra trees classifier as follows:

... # tune model hyperparameters best = tune_model(ExtraTreesClassifier(), n_iter=200)

The function will return the best-performing model, which can be used directly or printed to determine the hyperparameters that were selected.

It will also print a table of the results for the best configuration across the number of folds in the k-fold cross-validation (e.g. 10 folds).

Tying this together, the complete example of tuning the hyperparameters of the extra trees classifier on the Sonar dataset is listed below.

# tune model hyperparameters on the sonar classification dataset from pandas import read_csv from sklearn.ensemble import ExtraTreesClassifier from pycaret.classification import setup from pycaret.classification import tune_model # define the location of the dataset url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv' # load the dataset df = read_csv(url, header=None) # set column names as the column number n_cols = df.shape[1] df.columns = [str(i) for i in range(n_cols)] # setup the dataset grid = setup(data=df, target=df.columns[-1], html=False, silent=True, verbose=False) # tune model hyperparameters best = tune_model(ExtraTreesClassifier(), n_iter=200, choose_better=True) # report the best model print(best)

Running the example first loads the dataset and configures the PyCaret library.

A grid search is then performed reporting the performance of the best-performing configuration across the 10 folds of cross-validation and the mean accuracy.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the random search found a configuration with an accuracy of about 75.29 percent, which is not better than the default configuration from the previous section that achieved a score of about 86.95 percent.

Accuracy AUC Recall Prec. F1 Kappa MCC 0 0.8667 1.0000 1.0000 0.7778 0.8750 0.7368 0.7638 1 0.6667 0.8393 0.4286 0.7500 0.5455 0.3119 0.3425 2 0.6667 0.8036 0.2857 1.0000 0.4444 0.2991 0.4193 3 0.7333 0.7321 0.4286 1.0000 0.6000 0.4444 0.5345 4 0.6667 0.5714 0.2857 1.0000 0.4444 0.2991 0.4193 5 0.8571 0.8750 0.6667 1.0000 0.8000 0.6957 0.7303 6 0.8571 0.9583 0.6667 1.0000 0.8000 0.6957 0.7303 7 0.7857 0.8776 0.5714 1.0000 0.7273 0.5714 0.6325 8 0.6429 0.7959 0.2857 1.0000 0.4444 0.2857 0.4082 9 0.7857 0.8163 0.5714 1.0000 0.7273 0.5714 0.6325 Mean 0.7529 0.8270 0.5190 0.9528 0.6408 0.4911 0.5613 SD 0.0846 0.1132 0.2145 0.0946 0.1571 0.1753 0.1485 ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None, criterion='gini', max_depth=1, max_features='auto', max_leaf_nodes=None, max_samples=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=4, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=120, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)

We might be able to improve upon the grid search by specifying to the *tune_model()* function what hyperparameters to search and what ranges to search.

This section provides more resources on the topic if you are looking to go deeper.

In this tutorial, you discovered the PyCaret Python open source library for machine learning.

Specifically, you learned:

- PyCaret is a Python version of the popular and widely used caret machine learning package in R.
- How to use PyCaret to easily evaluate and compare standard machine learning models on a dataset.
- How to use PyCaret to easily tune the hyperparameters of a well-performing machine learning model.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to PyCaret for Machine Learning appeared first on Machine Learning Mastery.

]]>The post How to Identify Overfitting Machine Learning Models in Scikit-Learn appeared first on Machine Learning Mastery.

]]>An analysis of learning dynamics can help to identify whether a model has overfit the training dataset and may suggest an alternate configuration to use that could result in better predictive performance.

Performing an analysis of learning dynamics is straightforward for algorithms that learn incrementally, like neural networks, but it is less clear how we might perform the same analysis with other algorithms that do not learn incrementally, such as decision trees, k-nearest neighbors, and other general algorithms in the scikit-learn machine learning library.

In this tutorial, you will discover how to identify overfitting for machine learning models in Python.

After completing this tutorial, you will know:

- Overfitting is a possible cause of poor generalization performance of a predictive model.
- Overfitting can be analyzed for machine learning models by varying key model hyperparameters.
- Although overfitting is a useful tool for analysis, it must not be confused with model selection.

Let’s get started.

This tutorial is divided into five parts; they are:

- What Is Overfitting
- How to Perform an Overfitting Analysis
- Example of Overfitting in Scikit-Learn
- Counterexample of Overfitting in Scikit-Learn
- Separate Overfitting Analysis From Model Selection

Overfitting refers to an unwanted behavior of a machine learning algorithm used for predictive modeling.

It is the case where model performance on the training dataset is improved at the cost of worse performance on data not seen during training, such as a holdout test dataset or new data.

We can identify if a machine learning model has overfit by first evaluating the model on the training dataset and then evaluating the same model on a holdout test dataset.

If the performance of the model on the training dataset is significantly better than the performance on the test dataset, then the model may have overfit the training dataset.

We care about overfitting because it is a common cause for “*poor generalization*” of the model as measured by high “*generalization error*.” That is error made by the model when making predictions on new data.

This means, if our model has poor performance, maybe it is because it has overfit.

But what does it mean if a model’s performance is “*significantly better*” on the training set compared to the test set?

For example, it is common and perhaps normal for the model to have better performance on the training set than the test set.

As such, we can perform an analysis of the algorithm on the dataset to better expose the overfitting behavior.

An overfitting analysis is an approach for exploring how and when a specific model is overfitting on a specific dataset.

It is a tool that can help you learn more about the learning dynamics of a machine learning model.

This might be achieved by reviewing the model behavior during a single run for algorithms like neural networks that are fit on the training dataset incrementally.

A plot of the model performance on the train and test set can be calculated at each point during training and plots can be created. This plot is often called a learning curve plot, showing one curve for model performance on the training set and one curve for the test set for each increment of learning.

If you would like to learn more about learning curves for algorithms that learn incrementally, see the tutorial:

The common pattern for overfitting can be seen on learning curve plots, where model performance on the training dataset continues to improve (e.g. loss or error continues to fall or accuracy continues to rise) and performance on the test or validation set improves to a point and then begins to get worse.

If this pattern is observed, then training should stop at that point where performance gets worse on the test or validation set for algorithms that learn incrementally

This makes sense for algorithms that learn incrementally like neural networks, but what about other algorithms?

**How do you perform an overfitting analysis for machine learning algorithms in scikit-learn?**

One approach for performing an overfitting analysis on algorithms that do not learn incrementally is by varying a key model hyperparameter and evaluating the model performance on the train and test sets for each configuration.

To make this clear, let’s explore a case of analyzing a model for overfitting in the next section.

In this section, we will look at an example of overfitting a machine learning model to a training dataset.

First, let’s define a synthetic classification dataset.

We will use the make_classification() function to define a binary (two class) classification prediction problem with 10,000 examples (rows) and 20 input features (columns).

The example below creates the dataset and summarizes the shape of the input and output components.

# synthetic classification dataset from sklearn.datasets import make_classification # define dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1) # summarize the dataset print(X.shape, y.shape)

Running the example creates the dataset and reports the shape, confirming our expectations.

(10000, 20) (10000,)

Next, we need to split the dataset into train and test subsets.

We will use the train_test_split() function and split the data into 70 percent for training a model and 30 percent for evaluating it.

# split a dataset into train and test sets from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split # create dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # summarize the shape of the train and test sets print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example splits the dataset and we can confirm that we have 7,000 examples for training and 3,000 for evaluating a model.

(7000, 20) (3000, 20) (7000,) (3000,)

Next, we can explore a machine learning model overfitting the training dataset.

We will use a decision tree via the DecisionTreeClassifier and test different tree depths with the “*max_depth*” argument.

Shallow decision trees (e.g. few levels) generally do not overfit but have poor performance (high bias, low variance). Whereas deep trees (e.g. many levels) generally do overfit and have good performance (low bias, high variance). A desirable tree is one that is not so shallow that it has low skill and not so deep that it overfits the training dataset.

We evaluate decision tree depths from 1 to 20.

... # define the tree depths to evaluate values = [i for i in range(1, 21)]

We will enumerate each tree depth, fit a tree with a given depth on the training dataset, then evaluate the tree on both the train and test sets.

The expectation is that as the depth of the tree increases, performance on train and test will improve to a point, and as the tree gets too deep, it will begin to overfit the training dataset at the expense of worse performance on the holdout test set.

... # evaluate a decision tree for each depth for i in values: # configure the model model = DecisionTreeClassifier(max_depth=i) # fit model on the training dataset model.fit(X_train, y_train) # evaluate on the train dataset train_yhat = model.predict(X_train) train_acc = accuracy_score(y_train, train_yhat) train_scores.append(train_acc) # evaluate on the test dataset test_yhat = model.predict(X_test) test_acc = accuracy_score(y_test, test_yhat) test_scores.append(test_acc) # summarize progress print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc))

At the end of the run, we will then plot all model accuracy scores on the train and test sets for visual comparison.

... # plot of train and test scores vs tree depth pyplot.plot(values, train_scores, '-o', label='Train') pyplot.plot(values, test_scores, '-o', label='Test') pyplot.legend() pyplot.show()

Tying this together, the complete example of exploring different tree depths on the synthetic binary classification dataset is listed below.

# evaluate decision tree performance on train and test sets with different tree depths from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.tree import DecisionTreeClassifier from matplotlib import pyplot # create dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # define lists to collect scores train_scores, test_scores = list(), list() # define the tree depths to evaluate values = [i for i in range(1, 21)] # evaluate a decision tree for each depth for i in values: # configure the model model = DecisionTreeClassifier(max_depth=i) # fit model on the training dataset model.fit(X_train, y_train) # evaluate on the train dataset train_yhat = model.predict(X_train) train_acc = accuracy_score(y_train, train_yhat) train_scores.append(train_acc) # evaluate on the test dataset test_yhat = model.predict(X_test) test_acc = accuracy_score(y_test, test_yhat) test_scores.append(test_acc) # summarize progress print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc)) # plot of train and test scores vs tree depth pyplot.plot(values, train_scores, '-o', label='Train') pyplot.plot(values, test_scores, '-o', label='Test') pyplot.legend() pyplot.show()

Running the example fits and evaluates a decision tree on the train and test sets for each tree depth and reports the accuracy scores.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see a trend of increasing accuracy on the training dataset with the tree depth to a point around a depth of 19-20 levels where the tree fits the training dataset perfectly.

We can also see that the accuracy on the test set improves with tree depth until a depth of about eight or nine levels, after which accuracy begins to get worse with each increase in tree depth.

This is exactly what we would expect to see in a pattern of overfitting.

We would choose a tree depth of eight or nine before the model begins to overfit the training dataset.

>1, train: 0.769, test: 0.761 >2, train: 0.808, test: 0.804 >3, train: 0.879, test: 0.878 >4, train: 0.902, test: 0.896 >5, train: 0.915, test: 0.903 >6, train: 0.929, test: 0.918 >7, train: 0.942, test: 0.921 >8, train: 0.951, test: 0.924 >9, train: 0.959, test: 0.926 >10, train: 0.968, test: 0.923 >11, train: 0.977, test: 0.925 >12, train: 0.983, test: 0.925 >13, train: 0.987, test: 0.926 >14, train: 0.992, test: 0.921 >15, train: 0.995, test: 0.920 >16, train: 0.997, test: 0.913 >17, train: 0.999, test: 0.918 >18, train: 0.999, test: 0.918 >19, train: 1.000, test: 0.914 >20, train: 1.000, test: 0.913

A figure is also created that shows line plots of the model accuracy on the train and test sets with different tree depths.

The plot clearly shows that increasing the tree depth in the early stages results in a corresponding improvement in both train and test sets.

This continues until a depth of around 10 levels, after which the model is shown to overfit the training dataset at the cost of worse performance on the holdout dataset.

This analysis is interesting. It shows why the model has a worse hold-out test set performance when “*max_depth*” is set to large values.

But it is not required.

We can just as easily choose a “*max_depth*” using a grid search without performing an analysis on why some values result in better performance and some result in worse performance.

In fact, in the next section, we will show where this analysis can be misleading.

Sometimes, we may perform an analysis of machine learning model behavior and be deceived by the results.

A good example of this is varying the number of neighbors for the k-nearest neighbors algorithms, which we can implement using the KNeighborsClassifier class and configure via the “*n_neighbors*” argument.

Let’s forget how KNN works for the moment.

We can perform the same analysis of the KNN algorithm as we did in the previous section for the decision tree and see if our model overfits for different configuration values. In this case, we will vary the number of neighbors from 1 to 50 to get more of the effect.

The complete example is listed below.

# evaluate knn performance on train and test sets with different numbers of neighbors from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from sklearn.neighbors import KNeighborsClassifier from matplotlib import pyplot # create dataset X, y = make_classification(n_samples=10000, n_features=20, n_informative=5, n_redundant=15, random_state=1) # split into train test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # define lists to collect scores train_scores, test_scores = list(), list() # define the tree depths to evaluate values = [i for i in range(1, 51)] # evaluate a decision tree for each depth for i in values: # configure the model model = KNeighborsClassifier(n_neighbors=i) # fit model on the training dataset model.fit(X_train, y_train) # evaluate on the train dataset train_yhat = model.predict(X_train) train_acc = accuracy_score(y_train, train_yhat) train_scores.append(train_acc) # evaluate on the test dataset test_yhat = model.predict(X_test) test_acc = accuracy_score(y_test, test_yhat) test_scores.append(test_acc) # summarize progress print('>%d, train: %.3f, test: %.3f' % (i, train_acc, test_acc)) # plot of train and test scores vs number of neighbors pyplot.plot(values, train_scores, '-o', label='Train') pyplot.plot(values, test_scores, '-o', label='Test') pyplot.legend() pyplot.show()

Running the example fits and evaluates a KNN model on the train and test sets for each number of neighbors and reports the accuracy scores.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

Recall, we are looking for a pattern where performance on the test set improves and then starts to get worse, and performance on the training set continues to improve.

We do not see this pattern.

Instead, we see that accuracy on the training dataset starts at perfect accuracy and falls with almost every increase in the number of neighbors.

We also see that performance of the model on the holdout test improves to a value of about five neighbors, holds level and begins a downward trend after that.

>1, train: 1.000, test: 0.919 >2, train: 0.965, test: 0.916 >3, train: 0.962, test: 0.932 >4, train: 0.957, test: 0.932 >5, train: 0.954, test: 0.935 >6, train: 0.953, test: 0.934 >7, train: 0.952, test: 0.932 >8, train: 0.951, test: 0.933 >9, train: 0.949, test: 0.933 >10, train: 0.950, test: 0.935 >11, train: 0.947, test: 0.934 >12, train: 0.947, test: 0.933 >13, train: 0.945, test: 0.932 >14, train: 0.945, test: 0.932 >15, train: 0.944, test: 0.932 >16, train: 0.944, test: 0.934 >17, train: 0.943, test: 0.932 >18, train: 0.943, test: 0.935 >19, train: 0.942, test: 0.933 >20, train: 0.943, test: 0.935 >21, train: 0.942, test: 0.933 >22, train: 0.943, test: 0.933 >23, train: 0.941, test: 0.932 >24, train: 0.942, test: 0.932 >25, train: 0.942, test: 0.931 >26, train: 0.941, test: 0.930 >27, train: 0.941, test: 0.932 >28, train: 0.939, test: 0.932 >29, train: 0.938, test: 0.931 >30, train: 0.938, test: 0.931 >31, train: 0.937, test: 0.931 >32, train: 0.938, test: 0.931 >33, train: 0.937, test: 0.930 >34, train: 0.938, test: 0.931 >35, train: 0.937, test: 0.930 >36, train: 0.937, test: 0.928 >37, train: 0.936, test: 0.930 >38, train: 0.937, test: 0.930 >39, train: 0.935, test: 0.929 >40, train: 0.936, test: 0.929 >41, train: 0.936, test: 0.928 >42, train: 0.936, test: 0.929 >43, train: 0.936, test: 0.930 >44, train: 0.935, test: 0.929 >45, train: 0.935, test: 0.929 >46, train: 0.934, test: 0.929 >47, train: 0.935, test: 0.929 >48, train: 0.934, test: 0.929 >49, train: 0.934, test: 0.929 >50, train: 0.934, test: 0.929

A figure is also created that shows line plots of the model accuracy on the train and test sets with different numbers of neighbors.

The plots make the situation clearer. It looks as though the line plot for the training set is dropping to converge with the line for the test set. Indeed, this is exactly what is happening.

Now, recall how KNN works.

The “*model*” is really just the entire training dataset stored in an efficient data structure. Skill for the “*model*” on the training dataset should be 100 percent and anything less is unforgivable.

In fact, this argument holds for any machine learning algorithm and slices to the core of the confusion around overfitting for beginners.

Overfitting can be an explanation for poor performance of a predictive model.

Creating learning curve plots that show the learning dynamics of a model on the train and test dataset is a helpful analysis for learning more about a model on a dataset.

**But overfitting should not be confused with model selection.**

We choose a predictive model or model configuration based on its out-of-sample performance. That is, its performance on new data not seen during training.

The reason we do this is that in predictive modeling, we are primarily interested in a model that makes skillful predictions. We want the model that can make the best possible predictions given the time and computational resources we have available.

This might mean we choose a model that looks like it has overfit the training dataset. In which case, an overfit analysis might be misleading.

It might also mean that the model has poor or terrible performance on the training dataset.

In general, if we cared about model performance on the training dataset in model selection, then we would expect a model to have perfect performance on the training dataset. It’s data we have available; we should not tolerate anything less.

As we saw with the KNN example above, we can achieve perfect performance on the training set by storing the training set directly and returning predictions with one neighbor at the cost of poor performance on any new data.

**Wouldn’t a model that performs well on both train and test datasets be a better model?**

Maybe. But, maybe not.

This argument is based on the idea that a model that performs well on both train and test sets has a better understanding of the underlying problem.

A corollary is that a model that performs well on the test set but poor on the training set is lucky (e.g. a statistical fluke) and a model that performs well on the train set but poor on the test set is overfit.

I believe this is the sticking point for beginners that often ask how to fix overfitting for their scikit-learn machine learning model.

The worry is that a model must perform well on both train and test sets, otherwise, they are in trouble.

**This is not the case**.

Performance on the training set is not relevant during model selection. You must focus on the out-of-sample performance only when choosing a predictive model.

This section provides more resources on the topic if you are looking to go deeper.

- How to Avoid Overfitting in Deep Learning Neural Networks
- Overfitting and Underfitting With Machine Learning Algorithms
- How to use Learning Curves to Diagnose Machine Learning Model Performance

- sklearn.datasets.make_classification API.
- sklearn.model_selection.train_test_split API.
- sklearn.tree.DecisionTreeClassifier API.
- sklearn.neighbors.KNeighborsClassifier API.

In this tutorial, you discovered how to identify overfitting for machine learning models in Python.

Specifically, you learned:

- Overfitting is a possible cause of poor generalization performance of a predictive model.
- Overfitting can be analyzed for machine learning models by varying key model hyperparameters.
- Although overfitting is a useful tool for analysis, it must not be confused with model selection.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post How to Identify Overfitting Machine Learning Models in Scikit-Learn appeared first on Machine Learning Mastery.

]]>