Python Machine Learning Mini-Course

From Developer to Machine Learning Practitioner in 14 Days

Python is one of the fastest-growing platforms for applied machine learning.

In this mini-course, you will discover how you can get started, build accurate models and confidently complete predictive modeling machine learning projects using Python in 14 days.

This is a big and important post. You might want to bookmark it.

Let’s get started.

Update Oct/2016: Updated examples for sklearn v0.18.

Python Machine Learning Mini-Course

Python Machine Learning Mini-Course
Photo by Dave Young, some rights reserved.

Who Is This Mini-Course For?

Before we get started, let’s make sure you are in the right place.

The list below provides some general guidelines as to who this course was designed for.

Don’t panic if you don’t match these points exactly, you might just need to brush up in one area or another to keep up.

  • Developers that know how to write a little code. This means that it is not a big deal for you to pick up a new programming language like Python once you know the basic syntax. It does not mean you’re a wizard coder, just that you can follow a basic C-like language with little effort.
  • Developers that know a little machine learning. This means you know the basics of machine learning like cross-validation, some algorithms and the bias-variance trade-off. It does not mean that you are a machine learning Ph.D., just that you know the landmarks or know where to look them up.

This mini-course is neither a textbook on Python or a textbook on machine learning.

It will take you from a developer that knows a little machine learning to a developer who can get results using the Python ecosystem, the rising platform for professional machine learning.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with sample code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Mini-Course Overview

This mini-course is broken down into 14 lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hard core!). It really depends on the time you have available and your level of enthusiasm.

Below are 14 lessons that will get you started and productive with machine learning in Python:

  • Lesson 1: Download and Install Python and SciPy ecosystem.
  • Lesson 2: Get Around In Python, NumPy, Matplotlib and Pandas.
  • Lesson 3: Load Data From CSV.
  • Lesson 4: Understand Data with Descriptive Statistics.
  • Lesson 5: Understand Data with Visualization.
  • Lesson 6: Prepare For Modeling by Pre-Processing Data.
  • Lesson 7: Algorithm Evaluation With Resampling Methods.
  • Lesson 8: Algorithm Evaluation Metrics.
  • Lesson 9: Spot-Check Algorithms.
  • Lesson 10: Model Comparison and Selection.
  • Lesson 11: Improve Accuracy with Algorithm Tuning.
  • Lesson 12: Improve Accuracy with Ensemble Predictions.
  • Lesson 13: Finalize And Save Your Model.
  • Lesson 14: Hello World End-to-End Project.

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions and even post results in the comments below.

The lessons expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help on and about the Python platform (hint, I have all of the answers directly on this blog, use the search feature).

I do provide more help in the early lessons because I want you to build up some confidence and inertia.

Hang in there, don’t give up!

Lesson 1: Download and Install Python and SciPy

You cannot get started with machine learning in Python until you have access to the platform.

Today’s lesson is easy, you must download and install the Python 2.7 platform on your computer.

Visit the Python homepage and download Python for your operating system (Linux, OS X or Windows). Install Python on your computer. You may need to use a platform specific package manager such as macports on OS X or yum on RedHat Linux.

You also need to install the SciPy platform and the scikit-learn library. I recommend using the same approach that you used to install Python.

You can install everything at once (much easier) with Anaconda. Recommended for beginners.

Start Python for the first time by typing “python” at the command line.

Check the versions of everything you are going to need using the code below:

If there are any errors, stop.

Now is the time to fix them.

Lesson 2: Get Around In Python, NumPy, Matplotlib and Pandas.

You need to be able to read and write basic Python scripts.

As a developer, you can pick-up new programming languages pretty quickly. Python is case sensitive, uses hash (#) for comments and uses whitespace to indicate code blocks (whitespace matters).

Today’s task is to practice the basic syntax of the Python programming language and important SciPy data structures in the Python interactive environment.

  • Practice assignment, working with lists and flow control in Python.
  • Practice working with NumPy arrays.
  • Practice creating simple plots in Matplotlib.
  • Practice working with Pandas Series and DataFrames.

For example, below is a simple example of creating a Pandas DataFrame.

Lesson 3: Load Data From CSV

Machine learning algorithms need data. You can load your own data from CSV files but when you are getting started with machine learning in Python you should practice on standard machine learning datasets.

Your task for today’s lesson is to get comfortable loading data into Python and to find and load standard machine learning datasets.

There are many excellent standard machine learning datasets in CSV format that you can download and practice with on the UCI machine learning repository.

  • Practice loading CSV files into Python using the CSV.reader() in the standard library.
  • Practice loading CSV files using NumPy and the numpy.loadtxt() function.
  • Practice loading CSV files using Pandas and the pandas.read_csv() function.

To get you started, below is a snippet that will load the Pima Indians onset of diabetes dataset using Pandas directly from the UCI Machine Learning Repository.

Well done for making it this far! Hang in there.

Any questions so far? Ask in the comments.

Lesson 4: Understand Data with Descriptive Statistics

Once you have loaded your data into Python you need to be able to understand it.

The better you can understand your data, the better and more accurate the models that you can build. The first step to understanding your data is to use descriptive statistics.

Today your lesson is to learn how to use descriptive statistics to understand your data. I recommend using the helper functions provided on the Pandas DataFrame.

  • Understand your data using the head() function to look at the first few rows.
  • Review the dimensions of your data with the shape property.
  • Look at the data types for each attribute with the dtypes property.
  • Review the distribution of your data with the describe() function.
  • Calculate pairwise correlation between your variables using the corr() function.

The below example loads the Pima Indians onset of diabetes dataset and summarizes the distribution of each attribute.

Try it out!

Lesson 5: Understand Data with Visualization

Continuing on from yesterday’s lesson, you must spend time to better understand your data.

A second way to improve your understanding of your data is by using data visualization techniques (e.g. plotting).

Today, your lesson is to learn how to use plotting in Python to understand attributes alone and their interactions. Again, I recommend using the helper functions provided on the Pandas DataFrame.

  • Use the hist() function to create a histogram of each attribute.
  • Use the plot(kind=’box’) function to create box-and-whisker plots of each attribute.
  • Use the pandas.scatter_matrix() function to create pairwise scatterplots of all attributes.

For example, the snippet below will load the diabetes dataset and create a scatterplot matrix of the dataset.

Sample Scatter Plot Matrix

Sample Scatter Plot Matrix

Lesson 6: Prepare For Modeling by Pre-Processing Data

Your raw data may not be setup to be in the best shape for modeling.

Sometimes you need to preprocess your data in order to best present the inherent structure of the problem in your data to the modeling algorithms. In today’s lesson, you will use the pre-processing capabilities provided by the scikit-learn.

The scikit-learn library provides two standard idioms for transforming data. Each transform is useful in different circumstances: Fit and Multiple Transform and Combined Fit-And-Transform.

There are many techniques that you can use to prepare your data for modeling. For example, try out some of the following

  • Standardize numerical data (e.g. mean of 0 and standard deviation of 1) using the scale and center options.
  • Normalize numerical data (e.g. to a range of 0-1) using the range option.
  • Explore more advanced feature engineering such as Binarizing.

For example, the snippet below loads the Pima Indians onset of diabetes dataset, calculates the parameters needed to standardize the data, then creates a standardized copy of the input data.

Lesson 7: Algorithm Evaluation With Resampling Methods

The dataset used to train a machine learning algorithm is called a training dataset. The dataset used to train an algorithm cannot be used to give you reliable estimates of the accuracy of the model on new data. This is a big problem because the whole idea of creating the model is to make predictions on new data.

You can use statistical methods called resampling methods to split your training dataset up into subsets, some are used to train the model and others are held back and used to estimate the accuracy of the model on unseen data.

Your goal with today’s lesson is to practice using the different resampling methods available in scikit-learn, for example:

  • Split a dataset into training and test sets.
  • Estimate the accuracy of an algorithm using k-fold cross validation.
  • Estimate the accuracy of an algorithm using leave one out cross validation.

The snippet below uses scikit-learn to estimate the accuracy of the Logistic Regression algorithm on the Pima Indians onset of diabetes dataset using 10-fold cross validation.

What accuracy did you get? Let me know in the comments.

Did you realize that this is the halfway point? Well done!

Lesson 8: Algorithm Evaluation Metrics

There are many different metrics that you can use to evaluate the skill of a machine learning algorithm on a dataset.

You can specify the metric used for your test harness in scikit-learn via the cross_validation.cross_val_score() function and defaults can be used for regression and classification problems. Your goal with today’s lesson is to practice using the different algorithm performance metrics available in the scikit-learn package.

  • Practice using the Accuracy and LogLoss metrics on a classification problem.
  • Practice generating a confusion matrix and a classification report.
  • Practice using RMSE and RSquared metrics on a regression problem.

The snippet below demonstrates calculating the LogLoss metric on the Pima Indians onset of diabetes dataset.

What log loss did you get? Let me know in the comments.

Lesson 9: Spot-Check Algorithms

You cannot possibly know which algorithm will perform best on your data beforehand.

You have to discover it using a process of trial and error. I call this spot-checking algorithms. The scikit-learn library provides an interface to many machine learning algorithms and tools to compare the estimated accuracy of those algorithms.

In this lesson, you must practice spot checking different machine learning algorithms.

  • Spot check linear algorithms on a dataset (e.g. linear regression, logistic regression and linear discriminate analysis).
  • Spot check some non-linear algorithms on a dataset (e.g. KNN, SVM and CART).
  • Spot-check some sophisticated ensemble algorithms on a dataset (e.g. random forest and stochastic gradient boosting).

For example, the snippet below spot-checks the K-Nearest Neighbors algorithm on the Boston House Price dataset.

What mean squared error did you get? Let me know in the comments.

Lesson 10: Model Comparison and Selection

Now that you know how to spot check machine learning algorithms on your dataset, you need to know how to compare the estimated performance of different algorithms and select the best model.

In today’s lesson, you will practice comparing the accuracy of machine learning algorithms in Python with scikit-learn.

  • Compare linear algorithms to each other on a dataset.
  • Compare nonlinear algorithms to each other on a dataset.
  • Compare different configurations of the same algorithm to each other.
  • Create plots of the results comparing algorithms.

The example below compares Logistic Regression and Linear Discriminant Analysis to each other on the Pima Indians onset of diabetes dataset.

Which algorithm got better results? Can you do better? Let me know in the comments.

Lesson 11: Improve Accuracy with Algorithm Tuning

Once you have found one or two algorithms that perform well on your dataset, you may want to improve the performance of those models.

One way to increase the performance of an algorithm is to tune its parameters to your specific dataset.

The scikit-learn library provides two ways to search for combinations of parameters for a machine learning algorithm. Your goal in today’s lesson is to practice each.

  • Tune the parameters of an algorithm using a grid search that you specify.
  • Tune the parameters of an algorithm using a random search.

The snippet below uses is an example of using a grid search for the Ridge Regression algorithm on the Pima Indians onset of diabetes dataset.

Which parameters achieved the best results? Can you do better? Let me know in the comments.

Lesson 12: Improve Accuracy with Ensemble Predictions

Another way that you can improve the performance of your models is to combine the predictions from multiple models.

Some models provide this capability built-in such as random forest for bagging and stochastic gradient boosting for boosting. Another type of ensembling called voting can be used to combine the predictions from multiple different models together.

In today’s lesson, you will practice using ensemble methods.

  • Practice bagging ensembles with the random forest and extra trees algorithms.
  • Practice boosting ensembles with the gradient boosting machine and AdaBoost algorithms.
  • Practice voting ensembles using by combining the predictions from multiple models together.

The snippet below demonstrates how you can use the Random Forest algorithm (a bagged ensemble of decision trees) on the Pima Indians onset of diabetes dataset.

Can you devise a better ensemble? Let me know in the comments.

Lesson 13: Finalize And Save Your Model

Once you have found a well-performing model on your machine learning problem, you need to finalize it.

In today’s lesson, you will practice the tasks related to finalizing your model.

Practice making predictions with your model on new data (data unseen during training and testing).
Practice saving trained models to file and loading them up again.

For example, the snippet below shows how you can create a Logistic Regression model, save it to file, then load it later and make predictions on unseen data.

Lesson 14: Hello World End-to-End Project

You now know how to complete each task of a predictive modeling machine learning problem.

In today’s lesson, you need to practice putting the pieces together and working through a standard machine learning dataset end-to-end.

Work through the iris dataset end-to-end (the hello world of machine learning)

This includes the steps:

  1. Understanding your data using descriptive statistics and visualization.
  2. Preprocessing the data to best expose the structure of the problem.
  3. Spot-checking a number of algorithms using your own test harness.
  4. Improving results using algorithm parameter tuning.
  5. Improving results using ensemble methods.
  6. Finalize the model ready for future use.

Take it slowly and record your results along the way.

What model did you use? What results did you get? Let me know in the comments.

The End!
(Look How Far You Have Come)

You made it. Well done!

Take a moment and look back at how far you have come.

  • You started off with an interest in machine learning and a strong desire to be able to practice and apply machine learning using Python.
  • You downloaded, installed and started Python, perhaps for the first time and started to get familiar with the syntax of the language.
  • Slowly and steadily over the course of a number of lessons you learned how the standard tasks of a predictive modeling machine learning project map onto the Python platform.
  • Building upon the recipes for common machine learning tasks you worked through your first machine learning problems end-to-end using Python.
  • Using a standard template, the recipes and experience you have gathered you are now capable of working through new and different predictive modeling machine learning problems on your own.

Don’t make light of this, you have come a long way in a short amount of time.

This is just the beginning of your machine learning journey with Python. Keep practicing and developing your skills.

Summary

How Did You Go With The Mini-Course?
Did you enjoy this mini-course?

Do you have any questions? Were there any sticking points?
Let me know. Leave a comment below.


Frustrated With Python Machine Learning?

Master Machine Learning With Python

Develop Your Own Models in Minutes

…with just a few lines of scikit-learn code

Discover how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more…

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


78 Responses to Python Machine Learning Mini-Course

  1. erdem September 30, 2016 at 12:23 am #

    Accuracy: 76.951% (4.841%)

  2. Joe Dorocak October 18, 2016 at 3:50 am #

    Hi Jason. Thanks for ALL you do. I was doing “Lesson 7: Algorithm Evaluation With Resampling Methods”, when I ran into the following challenges running Python 35 with sklearn VERSION 0.18. :

    c:\python35\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
    “This module will be removed in 0.20.”, DeprecationWarning)

    ALSO:

    TypeError Traceback (most recent call last)
    in ()
    51 done results = cross_val_score
    52 “””
    —> 53 print(“Accuracy: %.3f%% (%.3f%%)”) % (results.mean()*100.0, results.std()*100.0)

    TypeError: unsupported operand type(s) for %: ‘NoneType’ and ‘tuple’

  3. Joe Dorocak October 18, 2016 at 4:07 am #

    Continuation of above reply:

    Jason i think your print statement:
    print(“Accuracy: %.3f%% (%.3f%%)”) % (results.mean()*100.0, results.std()*100.0)

    should look like:
    print(“Accuracy: %.3f (%.3f)” % (results.mean()*100.0, results.std()*100.0))

    Thanks again for the GREAT info.

    Love and peace,
    Joe

    • Jason Brownlee October 18, 2016 at 5:56 am #

      Glad to hear you worked it out. Perhaps it was a Python 3 thing? The code works in Python 2.7

      I will look at the Deprecation Warning ASAP.

      • Joe Dorocak October 18, 2016 at 6:24 am #

        Thanks for the reply, Jason.

        Love and peace,
        Joe

  4. Joe Doroak October 25, 2016 at 5:57 am #

    Hi Jason,

    Here’s what i got for the log loss == ‘neg_log_loss’ scoring on the LogisticRegression Model

    model: LogisticRegression – scoring: neg_lo_loss
    – results summary: -49.255 mean (4.705) std
    – sorted(results):
    [-0.57565879615204196, -0.52778706048371593, -0.52755866512803806, -0.51792016214361636, -0.5127963295718494, -0.49019538734940965, -0.47043507959473152, -0.4514763172464305, -0.44345852864232038, -0.40816890220694385]

    Thanks for the great work. I’ll take up your email course after i finish with this.

    Love and peace,
    Joe

    • Jason Brownlee October 25, 2016 at 8:33 am #

      Thanks Joe, nice work.

      • Joe Dorocak October 29, 2016 at 7:04 am #

        Dear Jason,

        Regarding “Lesson 9: Spot-Check Algorithms”, I would like to know how can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE?

        I have published a post on my blog titled “Naive Spot-Check of AI Algorithms” which references your work. The post generates 36 Spot-Check Cases, using (3 Datasets x 4 Models(Algorithms) x 3 Scorings). There were 11 out of 36 Cases that returned numerical results. The other 25 Cases returned Errors or Warnings.

        Again, I would like to know how can I use Data Preparation for various (Dataset, Model(Algorithm), Scoring) combinations, AND which (Dataset, Model(Algorithm), Scoring) combinations are JUST INCOMPATIBLE?

        Thanks for the GREAT work.

        Love and peace,
        Joe

  5. Sooraj Maharjan October 28, 2016 at 4:52 pm #

    Hi Jason, thanks for the post. I’m running into issues while executing Lesson 7

    from sklearn.model_selection import KFold
    Traceback (most recent call last):

    File “”, line 1, in
    from sklearn.model_selection import KFold

    ImportError: No module named ‘sklearn.model_selection’

    I’ve also updated my version of spyder, which according to few posts online says should fix, but the issue prevails. Please help! Thanks!

    • Jason Brownlee October 29, 2016 at 7:37 am #

      Hi Sooraj, you must update scikit-learn to v0.18 or newer.

      • Sooraj Maharjan October 29, 2016 at 5:25 pm #

        Thanks Jason! I did that and it worked. I’m actually using Anaconda so that I don’t have to install packages individually, but since it is using the latest Python (3.5.2) and you used a previous version, it isn’t running as smoothly.

        This time I’m running into an issue of an unsupported operand (TypeError: unsupported operand type(s) for %: ‘NoneType’ and ‘tuple’) while trying to print accuracy results similar to Joe Dorocak, and his solution didn’t work for me. I’ll fiddle with it some more and hopefully I’ll find a fix.

        Nevertheless, I got following for accuracy w/o formatting:
        (76.951469583048521, 4.8410519245671946)

        • Sooraj Maharjan October 29, 2016 at 5:33 pm #

          Solution:
          Print statement needed to wrap both the formatting and values within itself
          print(“Accuracy: %.3f%% (%.3f%%)” %(results.mean()*100.0, results.std()*100.0))
          Accuracy: 76.951% (4.841%)

          • Jason Brownlee October 30, 2016 at 8:48 am #

            Glad to hear you worked it out Sooraj.

  6. Sooraj Maharjan October 29, 2016 at 5:28 pm #

    P.S. I’m not getting any emails when you post responses. Shouldn’t there be an option to opt in for that? I remember having that option on my blog.

  7. Ignatius November 9, 2016 at 5:04 am #

    Accuracy: 76.951% (4.841%)

  8. Ignatius November 9, 2016 at 5:18 am #

    ‘neg_mean_squared_error’: -107.28683898

  9. Ignatius November 9, 2016 at 5:23 am #

    #Comparison of algorithms:
    LR: 0.769515 (0.048411)
    LDA: 0.773462 (0.051592)

  10. Ignatius November 9, 2016 at 5:25 am #

    Please what does this line actually do:
    KFold(n_splits=10, random_state=7)?

  11. Ignatius November 9, 2016 at 5:26 am #

    And also this line:
    cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring)?

    • Jason Brownlee November 9, 2016 at 9:53 am #

      This like evaluates the model using 10-fold cross validation and returns a list of scores.

  12. Ignatius November 9, 2016 at 5:50 am #

    Done now. Quite interesting and in plain language. Thanks Jason. I thirst for more.

  13. Tobias November 28, 2016 at 7:59 pm #

    Hi Jason,

    I can’t seem to access your data sample:
    https://goo.gl/vhm1eU

    Can’t reach anything on https://archive.ics.uci.edu/.

    Is the data hosted somewhere else as well?

    • Jason Brownlee November 29, 2016 at 8:50 am #

      Hi Tobias,

      Sorry, the UCI Machine Learning Repository that hosts the datasets appears to be down at the moment.

      There is a back-up for the website with all the datasets here:
      http://mlr.cs.umass.edu/ml/

  14. marco December 10, 2016 at 2:48 pm #

    Thanks for the mini-course Jason – it’s been a great intro!

    I completed the end2end project and picked QDA as my algorithm of choice with the following results for accuracy.
    QDA: 0.973333 (0.032660)

    I tested across a number of validation metrics and algorithms and found QDA was consistently the top performer with LDA usually a close second.

    Again, thanks – it’s been an eye opener on how much there is for me to learn!
    cheers
    marco

    • Jason Brownlee December 11, 2016 at 5:21 am #

      Great work marco, and it’s nice to hear about QDA (it has worked well for me in the past as well).

  15. Mohamed December 25, 2016 at 11:37 pm #

    Thanks for the course Mr. Brownlee.
    I have an example of work done thanks to your courses :

    https://www.kaggle.com/mohamedl/d/uciml/pima-indians-diabetes-database/79-17-pima-indians-diabetes-log-regression

    Thanks again for sharing your knowledge.

  16. DSG December 28, 2016 at 3:57 am #

    I got following accuracies:

    Accuracy of Logreg: 76.69685577580314 (3.542589693856446)
    Accuracy of KNeighbors: 74.7470950102529 (5.575841908065769)

  17. DSG January 1, 2017 at 11:00 pm #

    results={}

    for name,model in models:
    results[name] = cross_val_score(model, X, Y, cv = 10, scoring=’accuracy’)
    print(‘{} score:{}’.format(name, results[name].mean()))

    logreg score:0.7669685577580314
    lda score:0.7734962406015038

  18. RicardoIracheta January 5, 2017 at 3:43 am #

    Hey it is a really nice introduction to this subject.
    Regarding Lesson 7… I get an error while importing KFold:

    ImportError: cannot import name stable_cumsum

    Hope you can help me with this

    • Jason Brownlee January 5, 2017 at 9:39 am #

      You may want to confirm that you have sklearn 0.18 or higher installed.

      Try running this script:

  19. Servando March 28, 2017 at 8:54 pm #

    Hello Jason !.

    How is python 3 behavior with a machine learning enviroment?

    Will python 2.7 always be the best option for it?

    Thanks !

  20. Madhav Bhattarai March 30, 2017 at 8:22 pm #

    Awesome post Jason. Keep posting more high-quality tutorials like this.

  21. dna_star April 4, 2017 at 11:30 am #

    Hi Jason,

    First, very good website and tutos, nice job!

    Second, why do you keep the labels in X??

    Third, I am implementing my own score function in order to compute multiple scoring metrics at the same time. It works well with all of them except for the log loss.I obtain high values (around 7.8) Here is the code:

    from pandas import read_csv
    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import log_loss
    url = “https://goo.gl/vhm1eU”
    names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
    dataframe = read_csv(url, names=names)
    array = dataframe.values
    X = array[:,0:7]
    Y = array[:,8]
    kfold = KFold(n_splits=10, random_state=7)
    model = LogisticRegression()

    def my_scorer(estimator, x, y):
    yPred = estimator.predict(x)
    return log_loss(y, yPred)

    results = cross_val_score(model, X, Y, cv=kfold, scoring=my_scorer)
    print results.mean()

    Any explanations?

    Thank you!!
    Best

  22. Pratyush May 28, 2017 at 6:11 pm #

    Accuracy: 76.432% (2.859%)

  23. Pratyush May 29, 2017 at 2:31 am #

    Logloss : -49.266 Error : 4.689

  24. Pratyush May 29, 2017 at 2:56 am #

    lesson 9:
    -107.28683898

  25. Ilan June 13, 2017 at 12:28 pm #

    Hi Jason, I have run through most of the Lesson in this posts and I have to say thank you for that. It has been a while since I’ve been wanting to dig in more into ML and your blog will definitely be of help from now on.
    My results are:

    Lesson 7: Accuracy: 77.996% (5.009%)
    Lesson 8: Logloss: -0.484 (0.061)
    Lesson 9: Mean Sq Error = -28.5854635294

    I used the rescaled and standardised matrix X for all of my analysis.
    My question is: How will I know if rescaling is actually working? Is that given by the context? I suppose in your code you calculated your statistics using the data as raw and unprocessed as possible…

    When should I preprocess the data?

    Wow!! So many questions!!.

    Thank you again

  26. Alex July 7, 2017 at 1:08 pm #

    Hi Jason!

    Really great course! As someone just getting into machine learning, but knows how to code, this is the perfect level for me.

    I had a quick question. I’m going through the Iris dataset, and spot-checking different algorithms the way you demonstrated (results = cross_val_score(model, rescaledX, Y, cv=kfold)), and one of the algorithms I’m checking is the Ridge algorithm.

    Looking at the scores it returns:
    Ridge Results: [ 0. 0. 0. 0.753 0. 0. 0.848 0. 0. 0. ], it seems to perform alright sometimes, then get 0 other times. How come there is so much variation in accuracy between the testing results?

    • Jason Brownlee July 9, 2017 at 10:36 am #

      Glad to hear it Alex.

      Ridge regression is not used for classification generally.

      • Alex July 9, 2017 at 1:50 pm #

        Gotcha, thanks!

  27. ayyappan July 8, 2017 at 2:35 am #

    Hi Jason,

    Seems like we have an error in the following line in “Lesson 7: Algorithm Evaluation With Resampling Methods” ?

    print(“Accuracy: %.3f%% (%.3f%%)”) % (results.mean()*100.0, results.std()*100.0)

    Should be –

    print(“Accuracy: %.3f%% (%.3f%%)” % (results.mean()*100.0, results.std()*100.0))

    Regards,
    AA

  28. ayyappan July 8, 2017 at 3:51 am #

    Hi Jason,

    Same issue with Lesson 8 –

    Error –

    Logloss: %.3f (%.3f)
    Traceback (most recent call last):
    File “./classification_logloss.py”, line 16, in
    print(“Logloss: %.3f (%.3f)”) % (results.mean(), results.std())
    TypeError: unsupported operand type(s) for %: ‘NoneType’ and ‘tuple’

    Please change it to –

    print(“Logloss: %.3f (%.3f)” % (results.mean(), results.std()))

    Regards,
    AA

  29. Olly Smith July 25, 2017 at 11:35 pm #

    Hi Jason, amazing website, thank you so much for putting this course together.

    For lesson 7 I’m getting 76.951% (4.841%) using Kfold, though I know that’s an accuracy of 76%, I don’t know what the second figure is?

    As for leave-one-out, i’m getting 76.823% (42.196%), and 42% seems whack

    from sklearn.model_selection import LeaveOneOut
    loo = LeaveOneOut()
    model = LogisticRegression()
    results = cross_val_score(model, X, Y, cv=loo)
    print’\nAccuracy of LR using loo cross validation?’
    print(“Accuracy: %.3f%% (%.3f%%)”) % (results.mean()*100.0, results.std()*100.0)

    I feel like I’m missing out a step with LeaveOneOut regarding splits, but I’ve tried a few things from looking online to no avail.

    • Jason Brownlee July 26, 2017 at 7:56 am #

      The figure in brackets is the standard deviation of the model skill – e.g. how much variance there is in the skill from the mean skill each time the model is run on different data.

  30. Steven July 26, 2017 at 2:01 pm #

    for Lesson 7:
    Accuracy: 77.475% (5.206%)
    but I used:
    KFold(n_splits=9, random_state=7)

    • Jason Brownlee July 26, 2017 at 4:02 pm #

      Nice work Steven.

      • Steven July 26, 2017 at 4:07 pm #

        I have a question:
        If I use cross validation, how can I know whether it is overfitting or not?

        • Jason Brownlee July 27, 2017 at 7:54 am #

          Great question!

          You could take your chosen model, split your training dataset into train/validation and evaluate the skill of the model on both. Ideally, diagnostic plots of the method learning over time/iterations.

          • Steven July 27, 2017 at 11:36 am #

            Hi Jason,

            I also found that for KFold, if I use ‘n_split = 9’, I can get better accuracy than the other values like ‘n_split = 10’ or ‘n_split = 8’ without other optimizations.(I mean I only changed the value of parameter ‘n_split’.)

            So, here is the question: how can I save the model with the highest accuracy found during k-Fold cross validation? (that means I wanna save the model found when “n_split = 9” for later production use)
            Because in my understanding, cross validation contains two functionatilities: training the model and evaluate the model.

            Sincerely,
            Steven

          • Jason Brownlee July 28, 2017 at 8:26 am #

            I would not recommend doing that.

            CV is an evaluation scheme to estimate how good the model might be on unseen data. A different number of folds will give different scores with more/less bias.

            For normal machine learning models (e.g. not deep learning), I would recommend re-fitting a final model. Learn more here:
            http://machinelearningmastery.com/train-final-machine-learning-model/

            Does that help?

          • Steven July 28, 2017 at 6:05 pm #

            if “CV is an evaluation scheme to estimate how good the model might be on unseen data”, shall we use the dataset that haven’t been touched during training period to do CV?

            Sincerely,
            Steven

          • Jason Brownlee July 29, 2017 at 8:09 am #

            Not quite, CV will split up your training data into train/validation sets as part of the process.

            You can split your original dataset into train/test and hold back the test for evaluating the final models that you choose.

            This post might make things clearer:
            http://machinelearningmastery.com/difference-test-validation-datasets/

  31. Mukesh July 28, 2017 at 8:39 pm #

    Accuracy: 76.951% (4.841%)

  32. Mukesh July 28, 2017 at 8:41 pm #

    LR: 0.769515 (0.048411)
    LDA: 0.773462 (0.051592)

    LDA is best

    Cross validation result with mean 0.775974025974

  33. Dan September 22, 2017 at 10:12 pm #

    Hi Dr.Jason
    could you please explain this line

    model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

    • Jason Brownlee September 23, 2017 at 5:41 am #

      It fits a random forest classifier and stores the result on the model variable.

  34. Arindam October 1, 2017 at 8:59 am #

    Hi Jason,

    Great site and great way to get started. Enjoying going through the mini-course!

    I have a question on Lesson #9
    KNN works as coded in your example with an Accuracy of -88 with my kfold parameters

    Can I use LogisticRegression on this along with Accuracy scoring? When I tried using LogisticRegression Model on the Boston Housing Data sample in Lesson #9, I get a bunch of errors – ValueError: Unknown label type: ‘continuous’

    • Jason Brownlee October 1, 2017 at 9:10 am #

      Logistic regression is for classification problems (predicting a label), whereas the Boston house price problem is a regression problem (predicting a quantity).

      You cannot use classification algorithms on regression problems.

      • Arindam October 1, 2017 at 10:34 am #

        Ah! Got it. Thanks!

  35. Arindam October 1, 2017 at 10:33 am #

    Hi Jason,
    Another question – on Lesson 11

    I was trying to tune the parameters using a Random Search. So instead of using GridSearchCV I switched to RandomizedSearchCV, but am having difficulty setting the model (tried using Ridge as in the GridSearchCV example) and also the distribution parameters to try and tune the parameters for RandomizedSearchCV.

    How should I go about setting the model and the param_grid for the RandomizedSearchCV?

    Any pointers would be greatly appreciated.

    Thanks!

  36. Rajiv menon October 13, 2017 at 5:15 pm #

    Hi Jason, Nice work. I am confused with lesson 6 & later. In lesson 6, you create a preprocessed dataset

    rescaledX = scaler.transform(X).

    However, I did not see it being used in the subsequent chapters. Appreciate if you could help me understand what I am missing. Thanks
    Rajiv

    • Jason Brownlee October 14, 2017 at 5:41 am #

      Scaling data is important in some algorithms when your data is comprised of observations with different units of measure or different scales.

Leave a Reply