A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library

By Jason Brownlee on August 16, 2020 in Python Machine Learning 23

If you are a Python programmer or you are looking for a robust library you can use to bring machine learning into a production system then a library that you will want to seriously consider is scikit-learn.

In this post you will get an overview of the scikit-learn library and useful references of where you can learn more.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Where did it come from?

Scikit-learn was initially developed by David Cournapeau as a Google summer of code project in 2007.

Later Matthieu Brucher joined the project and started to use it as apart of his thesis work. In 2010 INRIA got involved and the first public release (v0.1 beta) was published in late January 2010.

The project now has more than 30 active contributors and has had paid sponsorship from INRIA, Google, Tinyclues and the Python Software Foundation.

Scikit-learn Homepage

What is scikit-learn?

Scikit-learn provides a range of supervised and unsupervised learning algorithms via a consistent interface in Python.

It is licensed under a permissive simplified BSD license and is distributed under many Linux distributions, encouraging academic and commercial use.

The library is built upon the SciPy (Scientific Python) that must be installed before you can use scikit-learn. This stack that includes:

NumPy: Base n-dimensional array package
SciPy: Fundamental library for scientific computing
Matplotlib: Comprehensive 2D/3D plotting
IPython: Enhanced interactive console
Sympy: Symbolic mathematics
Pandas: Data structures and analysis

Extensions or modules for SciPy care conventionally named SciKits. As such, the module provides learning algorithms and is named scikit-learn.

The vision for the library is a level of robustness and support required for use in production systems. This means a deep focus on concerns such as easy of use, code quality, collaboration, documentation and performance.

Although the interface is Python, c-libraries are leverage for performance such as numpy for arrays and matrix operations, LAPACK, LibSVM and the careful use of cython.

What are the features?

The library is focused on modeling data. It is not focused on loading, manipulating and summarizing data. For these features, refer to NumPy and Pandas.

Screenshot taken from a demo of the mean-shift clustering algorithm

Some popular groups of models provided by scikit-learn include:

Clustering: for grouping unlabeled data such as KMeans.
Cross Validation: for estimating the performance of supervised models on unseen data.
Datasets: for test datasets and for generating datasets with specific properties for investigating model behavior.
Dimensionality Reduction: for reducing the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis.
Ensemble methods: for combining the predictions of multiple supervised models.
Feature extraction: for defining attributes in image and text data.
Feature selection: for identifying meaningful attributes from which to create supervised models.
Parameter Tuning: for getting the most out of supervised models.
Manifold Learning: For summarizing and depicting complex multi-dimensional data.
Supervised Models: a vast array not limited to generalized linear models, discriminate analysis, naive bayes, lazy methods, neural networks, support vector machines and decision trees.

Example: Classification and Regression Trees

I want to give you an example to show you how easy it is to use the library.

In this example, we use the Classification and Regression Trees (CART) decision tree algorithm to model the Iris flower dataset.

This dataset is provided as an example dataset with the library and is loaded. The classifier is fit on the data and then predictions are made on the training data.

Finally, the classification accuracy and a confusion matrix is printed.

# Sample Decision Tree Classifier
from sklearn import datasets
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# load the iris datasets
dataset = datasets.load_iris()
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(dataset.data, dataset.target)
print(model)
# make predictions
expected = dataset.target
predicted = model.predict(dataset.data)
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

# Sample Decision Tree Classifier

from sklearn import datasets

from sklearn import metrics

from sklearn.tree import DecisionTreeClassifier

# load the iris datasets

dataset = datasets.load_iris()

# fit a CART model to the data

model = DecisionTreeClassifier()

model.fit(dataset.data, dataset.target)

print(model)

# make predictions

expected = dataset.target

predicted = model.predict(dataset.data)

# summarize the fit of the model

print(metrics.classification_report(expected, predicted))

print(metrics.confusion_matrix(expected, predicted))

Running this example produces the following output, showing you the details of the trained model, the skill of the model according to some common metrics and a confusion matrix.

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')
             precision    recall  f1-score   support

          0       1.00      1.00      1.00        50
          1       1.00      1.00      1.00        50
          2       1.00      1.00      1.00        50

avg / total       1.00      1.00      1.00       150

[[50  0  0]
 [ 0 50  0]
 [ 0  0 50]]

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,

max_features=None, max_leaf_nodes=None, min_samples_leaf=1,

min_samples_split=2, min_weight_fraction_leaf=0.0,

presort=False, random_state=None, splitter='best')

precision recall f1-score support

0 1.00 1.00 1.00 50

1 1.00 1.00 1.00 50

2 1.00 1.00 1.00 50

avg / total 1.00 1.00 1.00 150

[[50 0 0]

[ 0 50 0]

[ 0 0 50]]

Who is using it?

The scikit-learn testimonials page lists Inria, Mendeley, wise.io , Evernote, Telecom ParisTech and AWeber as users of the library.

If this is a small indication of companies that have presented on their use, then there are very likely tens to hundreds of larger organizations using the library.

It has good test coverage and managed releases and is suitable for prototype and production projects alike.

Resources

If you are interested in learning more, checkout the Scikit-Learn homepage that includes documentation and related resources.

You can get the code from the github repository, and releases are historically available on the Sourceforge project.

Documentation

I recommend starting out with the quick-start tutorial and flicking through the user guide and example gallery for algorithms that interest you.

Ultimately, scikit-learn is a library and the API reference will be the best documentation for getting things done.

Quick Start Tutorial http://scikit-learn.org/stable/tutorial/basic/tutorial.html
User Guide http://scikit-learn.org/stable/user_guide.html
API Reference http://scikit-learn.org/stable/modules/classes.html
Example Gallery http://scikit-learn.org/stable/auto_examples/index.html

Papers

If you interested in more information about how the project started and it’s vision, there are some papers you may want to check-out.

Books

If you are looking for a good book, I recommend “Building Machine Learning Systems with Python”. It’s well written and the examples are interesting.

23 Responses to A Gentle Introduction to Scikit-Learn: A Python Machine Learning Library

Joe McCarthy April 19, 2014 at 1:09 am #

This is a great overview of scikit-learn.

I recently learned about IPython Notebooks during a Strata 2014 session by Brian Granger, and have since found lots of valuable pythonic and machine learning resources provided through notebooks posted on GitHub and/or hosted on ipython.org.

Here are two I would recommend:

PyCon 2014 Scikit-learn Tutorial by Jake VanderPlas

Parallel Machine Learning with scikit-learn and IPython by Olivier Grisel (also offered at Strata 2014)

FWIW, I put together my own IPython Notebook on Python for Data Science, designed to provide a rapid on-ramp primer for people with knowledge of other programming languages to learn enough about Python to effectively use scikit-learn and other more advanced machine learning and scientific computing tools.

Reply
- jasonb April 19, 2014 at 5:20 am #
  
  Hey Joe, thanks for the links mate.
  
  Your own Python for Data Science notebook is amazing. It’s going to take me some time to digest fully. Thanks for sharing!
  
  Reply
- Abhishek November 5, 2016 at 2:45 pm #
  
  Thanks for sharing!!
  
  Reply
- Eugênio August 1, 2017 at 3:23 am #
  
  Thanks !
  
  Reply
Martin May 8, 2014 at 5:50 am #

Two corrections:
It’s matplotlib not mathplotlib and that’ll do 3d plots as well as 2d.

Reply
- jasonb May 8, 2014 at 7:53 am #
  
  Thanks, fixed.
  
  Reply
jai March 27, 2015 at 10:18 pm #

Thanks jasonb for providing such valuable tutorial on ML,

Basically i am a biologist and from past 1-2 year i am getting involved my self in machine learning. Presently i am dealing with scikit-learn and has some previus experince with WEKA 6, which is a best open source GUI based tool for ML as best of my undestaing. In scikit-learn i m strugling badely at one point i.e. feature selection, if i compre with weka, it provides various feature selection methods and result gives you a list of selected descriptos which can be saved easily in the form of reduced data.

Can you provide me any suggestion, how can i perform same task in Scikit-learn feature selection methods and can come up wiht the list of the names of selelcted features.

Thanks

Reply
MB August 28, 2015 at 7:52 am #

Hi Jason,

I am wondering if you run into this before. We have trained a model with training data and tested with test data of 100 instances for example and we got around 70% accuracy. Interesting aspect of scikit learn is that the predict function takes n_samples, this is fine when we are building and testing a model. But if I had to take this to production, I am having issues:
1. I can send only single request (instance) at a time. 2. If we test record by record, our accuracy drops to 30%. Do you have any idea why?

Reply
Robin White January 21, 2016 at 3:39 pm #

Cool!
And I would also like to introduce the course you can learn machine learning in Python http://www.thedevmasters.com/machine-learning-using-python/ I have taken that course before, then I could build my own library of Python scripts. I am sure that you will be satisfied with this bootcamp!

Reply
Alan May 18, 2017 at 8:27 pm #

Hi Jason,

I am very new to Machine Leaning.

When I tried to paste the code (for Classification and Regression Tree) above in my Python, it said:

Traceback (most recent call last):
File “C:/Users/Desktop/DecisionTree.py”, line 2, in
from sklearn import datasets
ImportError: No module named ‘sklearn’

It seemed like I need to install some package for my Python?

Thank you so much!!!

Reply
- Jason Brownlee May 19, 2017 at 8:19 am #
  
  Yes, you need to install sklearn.
  
  This tutorial will help get you started:
  https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
  
  Reply
nandini January 31, 2018 at 4:52 pm #

Hi Jason,

what is use of sklearn-porter package ,what is main purpose this module.

Please can you explain it.

Reply
- Jason Brownlee February 1, 2018 at 7:15 am #
  
  I’ve not heard of it sorry.
  
  Reply
  - Balasubramanian Janakiraman May 22, 2022 at 8:07 pm #
    
    Hi Jason,
    
    Its amazing to read your books. I would request you to enhance the machine learning mastrey book with what to look for model accuracy. For example how good is linear regression and what parameters to look for and is there any scope for improvement on model . How to find that we had hit max optimization on an ML algorithm. It would be great if you can recommend it thats covered in any of your books.
    
    Regards,
    Bala.J
    
    Reply
    - James Carmichael May 23, 2022 at 10:42 am #
      
      Thank you for the feedback!
      
      Reply
Jesús Martínez March 27, 2018 at 2:01 am #

Hey, Jason! Nice overview of sklearn, one of the most intuitive, useful and popular machine learning libraries nowadays!

You made a little typo here: “Example: Classification and Regression Tress”. Maybe you meant “Trees”? 🙂

Keep up the good work!

Reply
- Jason Brownlee March 27, 2018 at 6:38 am #
  
  Thanks, fixed.
  
  Reply
Michal January 14, 2020 at 11:06 pm #

Thanks Jason for a great article. I am familiar with ML and basic method of Scikit-Learn, where can i found more advance data about advanced function like ensamble, feature selection and so?

Reply
- Jason Brownlee January 15, 2020 at 8:26 am #
  
  Perhaps start here:
  https://machinelearningmastery.com/start-here/#python
  
  Reply
Miriam November 19, 2020 at 4:05 pm #

Hi! I am an undergraduate college student looking to do my thesis project on predictive diagnostics. I was wondering if scikit is beginner-friendly and can be used at an undergraduate level?

Reply
- Jason Brownlee November 20, 2020 at 6:42 am #
  
  Yes, it is a great place to start.
  
  Reply
A million thanks July 25, 2022 at 11:52 am #

Thank you very much Jason.
Very useful guidance
I am encouraged to learn M/L after reading your blog.

Reply
- James Carmichael July 26, 2022 at 8:34 am #
  
  You are very welcome! We greatly appreciate your support and feedback!
  
  Reply