Your First Machine Learning Project in Python Step-By-Step

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this post, you will complete your first machine learning project using Python.

In this step-by-step tutorial you will:

  1. Download and install Python SciPy and get the most useful package for machine learning in Python.
  2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
  3. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

Let’s get started!

  • Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
  • Updated Mar/2017: Added links to help setup your Python environment.
Your First Machine Learning Project in Python Step-By-Step

Your First Machine Learning Project in Python Step-By-Step
Photo by cosmoflash, some rights reserved.

How Do You Start Machine Learning in Python?

The best way to learn machine learning is by designing and completing small projects.

Python Can Be Intimidating When Getting Started

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.

There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

  • It will force you to install and start the Python interpreter (at the very least).
  • It will given you a bird’s eye view of how to step through a small project.
  • It will give you confidence, maybe to go on to your own small projects.

Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

A machine learning project may not be linear, but it has a number of well known steps:

  1. Define Problem.
  2. Prepare Data.
  3. Evaluate Algorithms.
  4. Improve Results.
  5. Present Results.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

  • Attributes are numeric so you have to figure out how to load and handle data.
  • It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
  • It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
  • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
  • All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in Python.

Machine Learning in Python: Step-By-Step Tutorial
(start here)

In this section, we are going to work through a small machine learning project end-to-end.

Here is an overview of what we are going to cover:

  1. Installing the Python and SciPy platform.
  2. Loading the dataset.
  3. Summarizing the dataset.
  4. Visualizing the dataset.
  5. Evaluating some algorithms.
  6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

If you have any questions at all, please leave a comment at the bottom of the post.

Beat Information Overload and Master the Fastest Growing Platform of Machine Learning Pros

Machine Learning Mastery With Python Mini-CourseGet my free Machine Learning With Python mini course and start loading your own datasets from CSV in just 1 hour.

Daily lessons in your inbox for 14 days, and a Machine-Learning-With-Python “Cheat Sheet” you can download right now.

Download Your FREE Mini-Course >>


1. Downloading, Installing and Starting Python SciPy

Get the Python and SciPy platform installed on your system if it is not already.

I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.

1.1 Install SciPy Libraries

This tutorial assumes Python version 2.7 or 3.5.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

  • scipy
  • numpy
  • matplotlib
  • pandas
  • sklearn

There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.

  • On Mac OS X, you can use macports to install Python 2.7 and these libraries. For more information on macports, see the homepage.
  • On Linux you can use your package manager, such as yum on Fedora to install RPMs.

If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.

Note: This tutorial assumes you have scikit-learn version 0.18 or higher installed.

Need more help? See one of these tutorials:

1.2 Start Python and Check Versions

It is a good idea to make sure your Python environment was installed successfully and is working as expected.

The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.

Open a command line and start the python interpreter:

I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.

Type or copy and paste the following script:

Here is the output I get on my OS X workstation:

Compare the above output to your versions.

Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.

If you get an error, stop. Now is the time to fix it.

If you cannot run the above script cleanly you will not be able to complete this tutorial.

My best advice is to Google search for your error message or post a question on Stack Exchange.

2. Load The Data

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

You can learn more about this dataset on Wikipedia.

In this step we are going to load the iris data from CSV file URL.

2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.

2.2 Load Dataset

We can load the data directly from the UCI Machine Learning repository.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

The dataset should load without incident.

If you do have network problems, you can download the file into your working directory and load it using the same method, changing URL to the local file name.

3. Summarize the Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Peek at the data itself.
  3. Statistical summary of all attributes.
  4. Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 150 instances and 5 attributes:

3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 20 rows of the data:

3.3 Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

We can see that each class has the same number of instances (50 or 33% of the dataset).

4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

  1. Univariate plots to better understand each attribute.
  2. Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots

Box and Whisker Plots

We can also create a histogram of each input variable to get an idea of the distribution.

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Histogram Plots

Histogram Plots

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Scattplot Matrix

Scatterplot Matrix

5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

  1. Separate out a validation dataset.
  2. Set-up the test harness to use 10-fold cross validation.
  3. Build 5 different models to predict species from flower measurements
  4. Select the best model.

5.1 Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.

5.2 Test Harness

We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB).
  • Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build and evaluate our five models:

5.3 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

We can see that it looks like KNN has the largest estimated accuracy score.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.

Compare Algorithm Accuracy

Compare Algorithm Accuracy

6. Make Predictions

The KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

You Can Do Machine Learning in Python

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.


In this post, you discovered step-by-step how to complete your first machine learning project in Python.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Your Next Step

Do you work through the tutorial?

  1. Work through the above tutorial.
  2. List any questions you have.
  3. Search or research the answers.
  4. Remember, you can use the help(“FunctionName”) in Python to get help on any function.

Do you have a question? Post it in the comments below.

Frustrated With Python Machine Learning?

Develop Your Own Models and Predictions in Minutes

...with just a few lines of scikit-learn code

Discover how in my new Ebook: Machine Learning Mastery With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


189 Responses to Your First Machine Learning Project in Python Step-By-Step

  1. DR Venugopala Rao Manneni June 11, 2016 at 5:58 pm #

    Awesome… But in your Blog please introduce SOM ( Self Organizing maps) for unsupervised methods and also add printing parameters ( Coefficients )code.

    • Jason Brownlee June 14, 2016 at 8:17 am #

      I generally don’t cover unsupervised methods like clustering and projection methods.

      This is because I mainly focus on and teach predictive modeling (e.g. classification and regression) and I just don’t find unsupervised methods that useful.

  2. Jan de Lange June 20, 2016 at 10:43 pm #

    Nice work Jason. Of course there is a lot more to tell about the code and the Models applied if this is intended for people starting out with ML (like me). Rather than telling which “button to press” to make work, it would be nice to know why also. I looked at a sample of you book (advanced) if you are covering the why also, but it looks like it’s limited?

    On this particular example, in my case SVM reached 99.2% and was thus the best Model. I gather this is because the test and training sets are drawn randomly from the data.

  3. Nil June 25, 2016 at 12:42 am #

    Awesome, I have tested the code it is impressive. But how could I use the model to predict if it is Iris-setosa or Iris-versicolor or Iris-virginica when I am given some values representing sepal-length, sepal-width, petal-length and petal-width attributes?

    • Jason Brownlee June 25, 2016 at 5:09 am #

      Great question. You can call model.predict() with some new data.

      For an example, see Part 6 in the above post.

      • JamieFox March 28, 2017 at 6:38 am #

        Dear Jason Brownlee, I was thinking about the same question of Nil. To be precise I was wondering how can I know, after having seen that my model has a good fit, which values of sepal-length, sepal-width, petal-length and petal-width corresponds to Iris-setosa eccc..
        For instance, if I have p predictors and two classes, how can I know which values of the predictors blend to one class or the other. Knowing the value of predictors allows me to use the model in the daily operativity. Thx

        • Jason Brownlee March 28, 2017 at 8:27 am #

          Not knowing the statistical relationship between inputs and outputs is one of the down sides of using neural networks.

          • JamieFox March 29, 2017 at 7:03 am #

            Hi Mr Jason Brownlee, thks for your answer. So all algorithms, such as SVM, LDA, random forest.. have this drawbacks? Can you suggest me something else?
            Because logistic regression is not like this, or am I wrong?

          • Jason Brownlee March 29, 2017 at 9:14 am #

            All algorithms have limitations and assumptions. For example, Logistic Regression makes assumptions about the distribution of variates (Gaussian) and more:

            Nevertheless, we can make useful models (skillful) even when breaking assumptions or pushing past limitations.

  4. Sujon September 6, 2016 at 8:19 am #

    Dear Sir,

    It seems I’m in the right place in right time! I’m doing my master thesis in machine learning from Stockholm University. Could you give me some references for laughter audio conversation to CSV file? You can send me anything on Thanks a lot and wish your very best and will keep in touch.

  5. Sujon September 6, 2016 at 8:32 am #

    Sorry I mean laughter audio to CSV conversion.

    • Jason Brownlee September 6, 2016 at 9:49 am #

      Sorry, I have not seen any laughter audio to CSV conversion tools/techniques.

  6. Roberto U September 19, 2016 at 9:17 am #

    Sweet way of condensing monstrous amount of information in a one-way street. Thanks!

    Just a small thing, you are creating the Kfold inside the loop in the cross validation. Then, you use the same seed to keep the comparison across predictors constant.

    That works, but I think it would be better to take it out of the loop. Not only is more efficient, but it is also much immediately clearer that all predictors are using the same Kfold.

    You can still justify the use of the seeds in terms of replicability; readers getting the same results on their machines.

    Thanks again!

  7. Francisco September 20, 2016 at 2:02 am #

    Hello Jaso.
    Thank you so much for your help with Machine Learning and congratulations for your excellent website.

    I am a beginner in ML and DeepLearning. Should I download Python 2 or Python 3?

    Thank you very much.


    • Jason Brownlee September 20, 2016 at 8:33 am #

      I use Python 2 for all my work, but my students report that most of my examples work in Python 3 with little change.

  8. ShawnJ October 11, 2016 at 5:24 am #


    Thank you so much for putting this together. I am been a software developer for almost two decades and am getting interested in machine learning. Found this tutorial accurate, easy to follow and very informative.

  9. Wendy G October 14, 2016 at 5:37 am #


    Thanks for the great post! I am trying to follow this post by using my own dataset, but I keep getting this error “Unknown label type: array ([some numbers from my dataset])”. So what’s the problem on earth, any possible solutions?


    • Jason Brownlee October 14, 2016 at 9:08 am #

      Hi Wendy,

      Carefully check your data. Maybe print it on the screen and inspect it. You may have some string values that you may need to convert to numbers using data preparation.

  10. fara October 20, 2016 at 7:15 am #

    hi thanks for great tutorial, i’m also new to ML…this really helps but i was wondering what if we have non-numeric values? i have mixture of numeric and non-numeric data and obviously this only works for numeric. do you also have a tutorial for that or would you please send me a source for it? thank you

    • Jason Brownlee October 20, 2016 at 8:41 am #

      Great question fara.

      We need to convert everything to numeric. For categorical values, you can convert them to integers (label encoding) and then to new binary features (one hot encoding).

  11. Mazhar Dootio October 23, 2016 at 9:14 pm #

    Hello Jason
    Thank you for publishing this great machine learning tutorial.
    It is really awesome awesome awesome………..!
    I test your tutorial on python-3 and it works well but what I face here is to load my data set from my local drive. I followed your give instructions but couldn’t be successful.
    My syntax is as under:

    import unicodedata
    url = open(r’C:\Users\mazhar\Anaconda3\Lib\site-packages\sindhi2.csv’, encoding=’utf-8′).readlines()
    names = [‘class’, ‘sno’, ‘gender’, ‘morphology’, ‘stem’,’fword’]
    dataset = pandas.read_csv(url, names=names)

    python-3 jupyter notebook does not loads this. Kindly help me in regard.

    • Jason Brownlee October 24, 2016 at 7:05 am #

      Hi Mazhar, thanks.

      Are you able to load the file on the command line away from the notebook?

      Perhaps the notebook environment is causing trouble?

  12. Mazhar Dootio October 25, 2016 at 3:22 am #

    Dear Jason
    Thank you for response
    I am using Python 3 with anaconda jupyter notebook
    so which python version you would like to suggest me and kindly write here syntax of opening local dataset file from local drive that how can I load utf-8 dataset file from my local drive.

    • Jason Brownlee October 25, 2016 at 8:32 am #

      Hi Mazhar, I teach using Python 2.7 with examples from the command line.

      Many of my students report that the code works in Python 3 and in notebooks with little or no changes.

  13. Andy October 27, 2016 at 11:59 pm #

    Great tutorial but perhaps I’m missing something here. Let’s assume I already know what model to use (perhaps because I know the data well… for example).

    knn = KNeighborsClassifier(), Y_train)

    I then use the models to predict:
    print(knn.predict(an array of variables of a record I want to classify))

    Is this where the whole ML happens?, Y_train)

    What’s the difference between this and say a non ML model/algorithm? Is it that in a non ML model I have to find the coefficients/parameters myself by statistical methods?; and in the ML model the machine does that itself?
    If this is the case then to me it seems that a researcher/coder did most of the work for me and wrap it in a nice function. Am I missing something? What is special here?

    • Jason Brownlee October 28, 2016 at 9:14 am #

      Hi Andy,

      Yes, your comment is generally true.

      The work is in the library and choice of good libraries and training on how to use them well on your project can take you a very long way very quickly.

      Stats is really about small data and understanding the domain (descriptive models). Machine learning, at least in common practice, is leaning towards automation with larger datasets and making predictions (predictive modeling) at the expense of model interpretation/understandability. Prediction performance trumps traditional goals of stats.

      Because of the automation, the focus shifts more toward data quality, problem framing, feature engineering, automatic algorithm tuning and ensemble methods (combining predictive models), with the algorithms themselves taking more of a backseat role.

      Does that make sense?

      • Andy November 3, 2016 at 10:36 pm #

        It does make sense.
        You mentioned ‘data quality’. That’s currently my field of work. I’ve been doing this statistically until now, and very keen to try a different approach. As a practical example how would you use ML to spot an error/outlier using ML instead of stats?
        Let’s say I have a large dataset containing trees: each tree record contains a specie, height, location, crown size, age, etc… (ah! suspiciously similar to the iris flowers dataset 🙂 Is ML a viable method for finding incorrect data and replace with an “estimated” value? The answer I guess is yes. For species I could use almost an identical method to what you presented here; BUT what about continuous values such as tree height?

        • Jason Brownlee November 4, 2016 at 9:08 am #

          Hi Andy,

          Maybe “outliers” are instances that cannot be easily predicted or assigned ambiguous predicted probabilities.

          Instance values can be “fixed” by estimating new values, but whole instance can also be pulled out if data is cheap.

  14. Shailendra Khadayat October 30, 2016 at 2:23 pm #

    Awesome work Jason. This was very helpful and expect more tutorials in the future.


  15. Shuvam Ghosh November 16, 2016 at 12:13 am #

    Awesome work. Students need to know how the end results will look like. They need to get motivated to learn and one of the effective means of getting motivated is to be able to see and experience the wonderful end results. Honestly, if i were made to study algorithms and understand them i would get bored. But now since i know what amazing results they give, they will serve as driving forces in me to get into details of it and do more research on it. This is where i hate the orthodox college ways of teaching. First get the theory right then apply. No way. I need to see things first to get motivated.

    • Jason Brownlee November 16, 2016 at 9:29 am #

      Thanks Shuvam,

      I’m glad my results-first approach gels with you. It’s great to have you here.

  16. Puneet November 17, 2016 at 12:08 am #

    Thanks Jason,

    while i am trying to complete this.

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())

    showing below error.-

    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    IndentationError: expected an indented block-

    • Jason Brownlee November 17, 2016 at 9:54 am #

      Hi Puneet, looks like a copy-paste error.

      Check for any extra new lines or white space around that line that is reporting the error.

  17. Puneet November 17, 2016 at 12:30 am #

    Thanks Json,

    I am new to ML. need your help so i can run this.

    as i have followed the steps but when trying to build and evalute 5 model using this.

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())

    facing below mentioned issue.
    File “”, line 13
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    IndentationError: expected an indented block

    Kindly help.

    • Martin November 18, 2016 at 5:18 am #

      Puneet, you need to indent the block (tab or four spaces to the right). That is the way of building a block in Python

  18. george soilis November 17, 2016 at 10:00 pm #

    just another Python noob here,sending many regards and thanks to Jason :):)

  19. sergio November 22, 2016 at 3:29 pm #

    Does this tutorial work with other data sets? I’m trying to work on a small assignment and I want to use python

    • Jason Brownlee November 23, 2016 at 8:50 am #

      It should provide a great template for new projects sergio.

  20. Albert November 26, 2016 at 1:55 am #

    Very Awesome step by step for me ! Even I am beginner of python , this gave me many things about Machine learning ~ supervised ML. Appreciate of your sharing !!

  21. Umar Yusuf November 27, 2016 at 4:04 am #

    Thank you for the step by step instructions. This will go along way for newbies like me getting started with machine learning.

    • Jason Brownlee November 27, 2016 at 10:21 am #

      You’re welcome, I’m glad you found the post useful Umar.

  22. Mike P November 30, 2016 at 6:29 pm #

    Hi Jason,

    Really nice tutorial. I had one question which has had me confused. Once you chose your best model, (in this instance KNN) you then train a new model to be used to make predictions against the validation set. should one not perform K-fold cross-validation on this model to ensure we don’t overfit?

    if this is correct how would you implement this, from my understanding cross_val_score will not allow one to generate a confusion matrix.

    I think this is the only thing that I have struggled with in using scikit learn if you could help me it would be much appreciated?

    • Jason Brownlee December 1, 2016 at 7:26 am #

      Hi Mike. No.

      Cross-validation is just a method to estimate the skill of a model on new data. Once you have the estimate you can get on with things, like confirming you have not fooled yourself (hold out validation dataset) or make predictions on new data.

      The skill you report is the cross val skill with the mean and stdev to give some idea of confidence or spread.

      Does that make sense?

      • Mike December 2, 2016 at 1:30 am #

        Hi Jason,

        Thanks for the quick response. So to make sure I understand, one would use cross validation to get a estimate of the skill of a model (mean of cross val scores) or chose the correct hyper parameters for a particular model.

        Once you have this information you can just go ahead and train the chosen model with the full training set and test it against the validation set or new data?

        • Jason Brownlee December 2, 2016 at 8:17 am #

          Hi Mike. Correct.

          Additionally, if the validation result confirms your expectations, you can go ahead and train the model on all data you have including the validation dataset and then start using it in production.

          This is a very important topic. I think I’ll write a post about it.

  23. Sahana Venkatesh November 30, 2016 at 8:15 pm #

    This is amazing 🙂 You boosted my morale

  24. Jhon November 30, 2016 at 8:27 pm #

    while doing data visualization and running commands dataset.plot(……..) i am having the following error.kindly tell me how to fix it

    ]], dtype=object)

    • Jason Brownlee December 1, 2016 at 7:28 am #

      Looks like no data Jhon. It also looks like it’s printing out an object.

      Are you running in a notebook or on the command line? The code was intended to be run directly (e.g. command line).

  25. Brendon A. Kay December 1, 2016 at 4:20 am #

    Hi Jason,

    Great tutorial. I am a developer with a computer science degree and a heavy interest in machine learning and mathematics, although I don’t quite have the academic background for the latter except for what was required in college. So, this website has really sparked my interest as it has allowed me to learn the field in sort of the “opposite direction”.

    I did notice when executing your code that there was a deprecation warning for the sklearn.cross_validation module. They recommend switching to sklearn.model_selection.

    When switching the modules I adjusted the following line…

    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)


    kfold = model_selection.KFold(n_folds=num_folds, random_state=seed)

    … and it appears to be working okay. Of course, I had switched all other instances of cross_validation as well, but it seemed to be that the KFold() method dropped the n (number of instances) parameter, which caused a runtime error. Also, I dropped the num_instances variable.

    I could have missed something here, so please let me know if this is not a valid replacement, but thought I’d share!

    Once again, great website!

    • Jason Brownlee December 1, 2016 at 7:33 am #

      Thanks for the support and the kind words Brendon. I really appreciate it (you made my day!)

      Yes, the API has changed/is changing and your updates to the tutorial look good to me, except I think n_folds has become n_splits.

      I will update this example for the new API very soon.

  26. Sergio December 1, 2016 at 3:41 pm #

    I’m still having a little trouble understanding step 5.1. I’m trying to apply this tutorial to a new data set but, when I try to evaluate the models from 5.3 I don’t get a result.

    • Jason Brownlee December 2, 2016 at 8:13 am #

      What is the problem exactly Sergio?

      Step 5.1 should create a validation dataset. You can confirm the dataset by printing it out.

      Step 5.3 should print the result of each algorithm as it is trained and evaluated.

      Perhaps check for a copy-paste error or something?

      • sergio December 2, 2016 at 9:13 am #

        Does this tutorial work the exact same way for other data sets? because I’m not using the Hello World dataset

        • Jason Brownlee December 3, 2016 at 8:23 am #

          The project template is quite transferable.

          You will need to adapt it for your data and for the types of algorithms you want to test.

  27. Jean-Baptiste Hubert December 11, 2016 at 12:17 am #

    Hi Sir,
    Thank you for the information.
    I am currently a student, in Engineering school in France.
    I am working on date mining project, indeed, I have a many date ( 40Go ) about the price of the stocks of many companies in the CAC40.
    My goal is to predict the evolution of the yields and I think that Neural Network could be useful.
    My idea is : I take for X the yields from “t=0” to “t=n” and for Y the yields from “t=1 to t=n” and the program should find a relation between the data.
    Is that possible ? Is it a good way in order to predict the evolution of the yield ?
    Thank you for your time

  28. Ernest Bonat December 15, 2016 at 5:33 pm #

    Hi Jason,

    If I include an new item in the models array as:

    models.append((‘LNR – Linear Regression’, LinearRegression()))

    with the library:

    from sklearn.linear_model import LinearRegression

    I got an error in the \sklearn\utils\”, line 529, in check_X_y
    y = y.astype(np.float64)


    ValueError: could not convert string to float: ‘Iris-setosa’

    Let me know best to fix that! As you can see from my code, I would like to include the Linear Regression algorithms in my array model too!

    Thank you for your help,


    • Jason Brownlee December 16, 2016 at 5:39 am #

      Hi Ernest, it is a classification problem. We cannot use LinearRegression.

      Try adding another classification algorithm to the list.

  29. Gokul Iyer December 20, 2016 at 2:29 pm #

    Great tutorial! Quick question, for the when we create the models, we do models.append(name of algorithm, alogrithm function), is models an array? Because it seems like a dictionary since we have a key-value mapping (algorithm name, and algorithm function). Thank you!

    • Jason Brownlee December 20, 2016 at 2:47 pm #

      It is a list of tuples where each tuple contains a string name and a model object.

  30. Sasanka ghosh December 21, 2016 at 4:55 am #

    Hi Jason /any Gurus ,
    Good post and will follow it but my question may be little off track.
    Asking this question as i am a data modeller /aspiring data architect.

    I i feel as Guru/Gurus you can clarify my doubt. The question is at the end .

    In current Data management environment

    1. Data architecture /Physical implementation and choosing appropriate tools,back end,storage,no sql, SQL, MPP, sharding, columnar ,,scale up/out ,distributed processing etc .

    2. In addition to DB based procedural languages proficiency at at least one of the following i.e. Java/Python/Scala etc.

    3. Then comes this AI,Machine learning ,neural Networks etc .

    My question is regarding point 3 .

    I believe those are algorithms which needs deep functional knowledge and years of experience to add any value to business .

    Those are independent of data models and it’s ,physical implementation and part of Business user domain not data architecture domain .

    If i take your above example say now 10k users trying to do the similar kind of thing then points 1 and 2 will be Data architects a domain and point 3 will be business analyst domain . may be point 2 can overlap between them to some extent .

    Data Architect need not to be hands on/proficient in algorithms i.e. should have just some basic idea as Data architects job is not to invent business logic but implement the business logic physically to satisfy Business users/Analysts .

    Am i correct in my assumption as i find the certain things are nearly mutually exclusive and expectations/benchmarks should be set right?

    sasanka ghosh

    • Jason Brownlee December 21, 2016 at 8:46 am #

      Hi Sasanka, sorry, I don’t really follow.

      Are you able to simplify your question?

      • Sasanka ghosh December 21, 2016 at 9:25 pm #

        Hi Jason ,
        Many thanks that u bothered to reply .

        Tried to rephrase and concise but still it is verbose . apologies for that.

        Is it expected from a data architect to be algorithm expert as well as data model/database expert?

        Algorithms are business centric as well as specific to particular domain of business most of the times.

        Giving u an example i.e. SHORTEST PATH ( take it as just an example in making my point)
        An organization is providing an app to provide that service .

        CAVEAT:Someone may say from comp science dept that it is the basic thing u learn but i feel it is still an algorithm not a data structure .

        if we take the above scenario in simplistic term the requirement is as follows

        1.there will be say million registered users
        2. one can say at least 10 % are using the app same time
        3. any time they can change their direction as per contingency like a military op so dumping the partial weighted graph to their device is not an option i.e. users will be connected to main server/server cluster.
        4. the challenge is storing the spatial data in DB in correct data model .
        scale out ,fault tolerance .
        5.implement the shortest path algo and display it using Python/java/Cipher/Oracle spatial/titan etc dynamically.

        My question is can a data architect work on this project who does not know the shortest path algorithm but have sufficient knowledge in other areas but the algo with verbose term provided to him/her to implement ?

        I m asking this question as now a days people are offering ready made courses etc i.e. machine learning ,NLP,Data scientist etc. and the scenario is confusing
        i feel it is misleading as no one can get expert in science overnight and vice versa.

        I feel Algorithms are pure science that is a separate discipline .
        But to implement it in large scale Scientists/programmers/architects needs to work in tandem with minimal overlapping but continuous discussion.

        Last but not the least if i make some sense what is the learning curve should i follow to try to be a data architect in unstructured data in general

        sasanka ghosh

        • Jason Brownlee December 22, 2016 at 6:35 am #

          Really this depends on the industry and the job. I cannot give you good advice for the general case.

          You can get valuable results without being an expert, this applies to most fields.

          Algorithms are a tool, use them as such. They can also be a science, but we practitioners don’t have the time.

          I hope that helps.

          • Sasanka ghosh December 22, 2016 at 7:00 pm #

            Thanks Jsaon.

            I appreciate your time and response .

            I just wanted to validate from a real techie/guru like u as the confusion or no perfect answer are being exploited by management/HR to their own advantage and practice use and throw policy or make people sycophants/redundant without following the basic management principle,

            The tech guys except ” few geniuses” are always toiling and management is opening the cork, enjoying at the same time .

            sasanka ghosh

  31. Raveen Sachintha December 21, 2016 at 8:51 pm #

    Hello Jason,
    Thank you very much for these tutorials, i am new to ML and i find it very encouraging to do and end to end project to get started with rather than reading and reading without seeing and end, This really helped me..

    One question, when i tried this i got the highest accuracy for SVM.

    LR: 0.966667 (0.040825)
    LDA: 0.975000 (0.038188)
    KNN: 0.983333 (0.033333)
    CART: 0.983333 (0.033333)
    NB: 0.975000 (0.053359)
    SVM: 0.991667 (0.025000)

    so i decided to try that out too,,

    svm = SVC(), Y_train)
    prediction = svm.predict(X_validation)

    these were my results using SVM,

    [[ 7 0 0]
    [ 0 10 2]
    [ 0 0 11]]
    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 7
    Iris-versicolor 1.00 0.83 0.91 12
    Iris-virginica 0.85 1.00 0.92 11

    avg / total 0.94 0.93 0.93 30

    I am still learning to read these results, but can you tell me why this happened? why did i get high accuracy for SVM instead of KNN?? have i done anything wrong? or is it possible?

    • Jason Brownlee December 22, 2016 at 6:33 am #

      The results reported are a mean estimated score with some variance (spread).

      It is an estimate on the performance on new data.

      When you apply the method on new data, the performance may be in that range. It may be lower if the method has overfit the training data.

      Overfitting is a challenge and developing a robust test harness to ensure we don’t fool/mislead ourselves during model development is important work.

      I hope that helps as a start.

  32. inzar December 25, 2016 at 7:04 am #

    i want to buy your book.
    i try this tutorial and the result is very awesome

    i want to learn from you


  33. lou December 25, 2016 at 7:29 am #

    Why the leading comma in X = array[:,0:4]?

  34. Thinh December 26, 2016 at 5:05 am #

    In 1.2 , should warn to install scikit-learn

    • Jason Brownlee December 26, 2016 at 7:49 am #

      Thanks for the note.

      Please see section 1.1 Install SciPy Libraries where it says:

      There are 5 key libraries that you will need to install… sklearn

  35. Tijo L. Peter December 28, 2016 at 10:34 pm #

    Best ML tutorial for Python. Thank you, Jason.

  36. baso December 29, 2016 at 12:38 am #

    when i tried run, i have error message” TypeError: Empty ‘DataFrame’: no numeric data to plot” help me

    • Jason Brownlee December 29, 2016 at 7:18 am #

      Sorry to hear that.

      Perhaps check that you have loaded the data as you expect and that the loaded values are numeric and not strings. Perhaps print the first few rows: print(df.head(5))

      • baso December 29, 2016 at 1:05 pm #

        thanks very much Jason for your time

        it worked. these tutorial very help for me. im new in Machine learning, but may you explain to me about your simple project above? because i did not see X_test and target

        regard in advance

  37. Andrea January 5, 2017 at 1:42 am #

    Thank you for sharing this. I bumped into some installation problems.
    Eventually, yo get all dependencies installed on MacOS 10.11.6 I had to run this:

    brew install python
    pip install –user numpy scipy matplotlib ipython jupyter pandas sympy nose scikit-learn
    export PATH=$PATH:~/Library/Python/2.7/bin

    • Jason Brownlee January 5, 2017 at 9:21 am #

      Thanks for sharing Andrea.

      I’m a macports guy myself, here’s my recipe:

  38. Sohib January 6, 2017 at 6:26 pm #

    Hi Jason,
    I am following this page as a beginner and have installed Anaconda as recommended.
    As I am on win 10, I installed Anaconda 4.2.0 For Windows Python 2.7 version (x64) and
    I am using Anaconda’s Spyder (python 2.7) IDE.

    I checked all the versions of libraries (as shown in 1.2 Start Python and Check Versions) and got results like below:

    Python: 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
    scipy: 0.18.1
    numpy: 1.11.1
    matplotlib: 1.5.3
    pandas: 0.18.1
    sklearn: 0.17.1

    At the 2.1 Import libraries section, I imported all of them and tried to load data as shown in
    2.2 Load Dataset. But when I run it, it doesn’t show an output, instead, there is an error:

    Traceback (most recent call last):
    File “C:\Users\gachon\.spyder\”, line 4, in
    from sklearn import model_selection
    ImportError: cannot import name model_selection

    Below is my code snippet:

    import pandas
    from import scatter_matrix
    import matplotlib.pyplot as plt
    from sklearn import model_selection
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    url = “”
    names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    dataset = pandas.read_csv(url, names=names)

    When I delete “from sklearn import model_selection” line I get expected results (150, 5).

    Am I missing something here?

    Thank you for your time and endurance!

    • Jason Brownlee January 7, 2017 at 8:23 am #

      Hi Sohib,

      You must have scikit-learn version 0.18 or higher installed.

      Perhaps Anaconda has documentation on how to update sklearn?

      • Sohib January 10, 2017 at 12:15 pm #

        Thank you for reply.

        I updated scikit-learn version to 0.18.1 and it helped.
        The error disappeared, the result is shown, but one statement

        ‘import sitecustomize’ failed; use -v for traceback

        is executed above the result.
        I tried to find out why, but apparently I might not find the reason.
        Is it going to be a problem in my further steps?
        How to solve this?

        Thank you in advance!

        • Jason Brownlee January 11, 2017 at 9:25 am #

          I’m glad to hear it fixed your problem.

          Sorry, I don’t know what “import sitecustomize” is or why you need it.

  39. Vishakha January 7, 2017 at 10:10 pm #

    Can i get the same tutorial with java

  40. Abhinav January 8, 2017 at 8:27 pm #

    Hi Jason,

    Nice tutorial.

    In univariate plots, you mentioned about gaussian distribution.

    According to the univariate plots, sepat-width had gaussian distribution. You said there are 2 variables having gaussain distribution. Please tell the other.


    • Jason Brownlee January 9, 2017 at 7:49 am #

      The distribution of the others may be multi-modal. Perhaps a double Gaussian.

  41. Thinh January 13, 2017 at 5:07 am #

    Hi, Jason. Could you please tell me the reason Why you choose KNN in example above ?

    • Jason Brownlee January 13, 2017 at 9:16 am #

      Hi Thinh,

      No reason other than it is an easy algorithm to run and understand and good algorithm for a first tutorial.

  42. Scott P January 13, 2017 at 10:25 pm #

    Hi Jason,

    I’m trying to use this code with the KDD Cup ’99 dataset, and I am having trouble with LabelEncoding my dataset in to numerical values.

    import pandas
    import numpy
    from import scatter_matrix
    import matplotlib.pyplot as plt
    from sklearn import preprocessing
    from sklearn import cross_validation
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.preprocessing import LabelEncoder
    from collections import defaultdict

    #Load KDD dataset
    data_set = “NSL-KDD/KDDTrain+.txt”
    names = [‘duration’,’protocol_type’,’service’,’flag’,’src_bytes’,’dst_bytes’,’land’,’wrong_fragment’,’urgent’,’hot’,’num_failed_logins’,’logged_in’,’num_compromised’,’su_attempted’,’num_root’,’num_file_creations’,

    #Diabetes Dataset
    #data_set = “Datasets/”
    #names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
    #data_set = “Datasets/”
    #names = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’,’class’]
    dataset = pandas.read_csv(data_set, names=names)

    array = dataset.values
    X = array[:,0:40]
    Y = array[:,40]

    label_encoder = LabelEncoder()
    label_encoder =
    label_encoded_y = label_encoder.transform(Y)

    validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, label_encoded_y, test_size=validation_size, random_state=seed)

    # Test options and evaluation metric
    num_folds = 7
    num_instances = len(X_train)
    seed = 7
    scoring = ‘accuracy’

    # Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))

    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    msg = “%s: %f (%f)” % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage

    # Compare Algorithms
    fig = plt.figure()
    fig.suptitle(‘Algorithm Comparison’)
    ax = fig.add_subplot(111)

    Am I doing something wrong with the LabelEncoding process?

  43. Dan January 14, 2017 at 4:56 am #

    Hi, I’m running a bit of a different setup than yours.

    The modules and version of python I’m using are more recent releases:

    Python: 3.5.2 |Anaconda 4.2.0 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
    scipy: 0.18.1
    numpy: 1.11.3
    matplotlib: 1.5.3
    pandas: 0.19.2
    sklearn: 0.18.1

    And I’ve gotten SVM as the best algorithm in terms of accuracy at 0.991667 (0.025000).

    Would you happen to know why this is, considering more recent versions?

    I also happened to get a rather different boxplot but I’ll leave it at what I’ve said thus far.

  44. Duncan Carr January 17, 2017 at 1:44 am #

    Hi Jason

    I can’t tell you how grateful I am … I have been trawling through lots of ML stuff to try to get started with a “toy” example. Finally I have found the tutorial I was looking for. Anaconda had old sklearn: 0.17.1 for Windows – which caused an error “ImportError: cannot import name ‘model_selection'”. That was fixed by running “pip install -U scikit-learn” from the Anaconda command-line prompt. Now upgraded to 0.18. Now everything in your imports was fine.

    All other tutorials were either too simple or too complicated. Usually the latter!

    Thank you again 🙂

    • Jason Brownlee January 17, 2017 at 7:39 am #

      Glad to hear it Duncan.

      Thanks for the tip for Anaconda uses.

      I’m here to help if you have questions!

  45. Malathi January 17, 2017 at 3:13 am #

    Hi Jason,

    Wonderful service. All of your tutorials are very helpful
    to me. Easy to understand.

    Expecting more tutorials on deep neural networks.


    • Jason Brownlee January 17, 2017 at 7:40 am #

      You’re very welcome Malathi, glad to hear it.

  46. Duncan Carr January 17, 2017 at 7:32 pm #

    Hi Jason

    I managed to get it all working – I am chuffed to bits.

    I get exactly the same numbers in the classification report as you do … however, when I changed both seeds to 8 (from 7), then ALL of the numbers end up being 1. Is this good, or bad? I am a bit confused.

    Thanks again.

    • Jason Brownlee January 18, 2017 at 10:14 am #

      Well done Duncan!

      What do you mean all the numbers end up being one?

  47. Duncan Carr January 18, 2017 at 8:02 pm #

    Hi Jason

    I’ve output the “accuracy_score”, “confusion_matrix” & “classification_report” for seeds 7, 9 & 10. Why am I getting a perfect score with seed=9? Many thanks.



    [[10 0 0]
    [ 0 8 1]
    [ 0 2 9]]

    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 10
    Iris-versicolor 0.80 0.89 0.84 9
    Iris-virginica 0.90 0.82 0.86 11

    avg / total 0.90 0.90 0.90 30



    [[13 0 0]
    [ 0 9 0]
    [ 0 0 8]]

    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 13
    Iris-versicolor 1.00 1.00 1.00 9
    Iris-virginica 1.00 1.00 1.00 8

    avg / total 1.00 1.00 1.00 30



    [[10 0 0]
    [ 0 12 1]
    [ 0 0 7]]

    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 10
    Iris-versicolor 1.00 0.92 0.96 13
    Iris-virginica 0.88 1.00 0.93 7

    avg / total 0.97 0.97 0.97 30

  48. shivani January 20, 2017 at 8:40 pm #

    from sklearn import model_selection
    showing Import Error: can not import model_selection

    • Jason Brownlee January 21, 2017 at 10:25 am #

      You need to update your version of sklearn to 0.18 or higher.

  49. Jim January 22, 2017 at 5:06 pm #


    Excellent Tutorial. New to Python and set a New Years Resolution to try to understand ML. This tutorial was a great start.

    I struck the issue of the sklearn version. I am using Ubuntu 16.04LTS which comes with python-sklearn version 0.17. To update to latest I used the site:
    Which gives the commands to add the neuro repository and pull down the 0.18 version.

    Also I would like to note there is an error in section 3.1 Dimensions of the Dataset. Your text states 120 Instances when in fact 150 are returned, which you have in the Printout box.

    Keep up the good work.


    • Jason Brownlee January 23, 2017 at 8:37 am #

      I’m glad to hear you worked around the version issue Jim, nice work!

      Thanks for the note on the typo, fixed!

  50. Raphael January 23, 2017 at 4:15 pm #

    hi Jason.nice work here. I’m new to your blog. What does the y-axis in the box plots represent?

    • Jason Brownlee January 24, 2017 at 11:01 am #

      Hi Raphael,

      The y-axis in the box-and-whisker plots are the scale or distribution of each variable.

  51. Kayode January 23, 2017 at 8:42 pm #

    Thank you for this wonderful tutorial.

  52. Raphael January 26, 2017 at 2:28 am #

    hi Jason,

    In this line


    what other variable other than size could I use? I changed size with count and got something similar but not quite. I got key errors for the other stuffs I tried. Is size just a standard command?

  53. Scott January 26, 2017 at 10:35 pm #


    I’m trying to use a different data set (KDD CUP 99′) with the above code, but when I try and run the code after modifying “names” and the array to account for the new features and it will not run as it is giving me an error of: “cannot convert string to a float”.

    In my data set, there are 3 columns that are text and the rest are integers and floats, I have tried LabelEncoding but it gives me the same error, do you know how I can resolve this?

    • Jason Brownlee January 27, 2017 at 12:08 pm #

      Hi Scott,

      If the values are indeed strings, perhaps you can use a method that supports strings instead of numbers, perhaps like a decision tree.

      If there are only a few string values for the column, a label encoding as integers may be useful.

      Alternatively, perhaps you could try removing those string features from the dataset.

      I hope that helps, let me know how you go.

  54. Weston Gross January 31, 2017 at 10:41 am #

    I would like a chart to see the grand scope of everything for data science that python can do.

    You list 6 basic steps. For example in the visualizing step, I would like to know what all the charts are, what they are used for, and what python library it comes from.

    I am extremely new to all this, and understand that some steps have to happen for example

    1. Get Data
    2. Validate Data
    3. Missing Data
    4. Machine Learning
    5. Display Findinds

    So for missing data, there are techniques to restore the data, what are they and what libraries are used?

    • Jason Brownlee February 1, 2017 at 10:36 am #

      You can handle missing data in a few ways such as:

      1. Remove rows with missing data.
      2. Impute missing data (e.g. use the Imputer class in sklearn)
      3. Use methods that support missing data (e.g. decision trees)

      I hope that helps.

  55. Mohammed February 1, 2017 at 1:11 am #

    Hi Jason,

    I am a Non Tech Data Analyst and use SPSS extensively on Academic / Business Data over the last 6 years.

    I understand the above example very easily.

    I want to work on Search – Language Translation and develop apps.

    Whats the best way forward …

    Do you also provide Skype Training / Project Mentoring..

    Thanks in advance.

    • Jason Brownlee February 1, 2017 at 10:51 am #

      Thanks Mohammed.

      Sorry, I don’t have good advice for language translation applications.

  56. Mohammed February 1, 2017 at 1:14 am #

    I dont have any Development / Coding Background.

    However, following your guidelines I downloaded SciPy and tested the code.

    Everything worked perfectly fine.

    Looking forward to go all in…

  57. Purvi February 1, 2017 at 7:31 am #

    Hi Jason,

    I am new to Machine learning and am trying out the tutorial. I have following environment :

    >>> import sys
    >>> print(‘Python: {}’.format(sys.version))
    Python: 2.7.10 (default, Jul 13 2015, 12:05:58)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
    >>> import scipy
    >>> print(‘scipy: {}’.format(scipy.__version__))
    scipy: 0.18.1
    >>> import numpy
    >>> print(‘numpy: {}’.format(numpy.__version__))
    numpy: 1.12.0
    >>> import matplotlib
    >>> print(‘matplotlib: {}’.format(matplotlib.__version__))
    matplotlib: 2.0.0
    >>> import pandas
    >>> print(‘pandas: {}’.format(pandas.__version__))
    pandas: 0.19.2
    >>> import sklearn
    >>> print(‘sklearn: {}’.format(sklearn.__version__))
    sklearn: 0.18.1

    When I try to load the iris dataset, it loads up fine and prints dataset.shape but then my python interpreter hangs. I tried it out 3-4 times and everytime it hangs after I run couple of commands on dataset.
    >>> url = “”
    >>> names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    >>> dataset = pandas.read_csv(url, names=names)
    >>> print(dataset.shape)
    (150, 5)
    >>> print(dataset.head(20))
    sepal-length sepal-width petal-length petal-width class
    0 5.1 3.5 1.4 0.2 Iris-setosa
    1 4.9 3.0 1.4 0.2 Iris-setosa
    2 4.7 3.2 1.3 0.2 Iris-setosa
    3 4.6 3.1 1.5 0.2 Iris-setosa
    4 5.0 3.6 1.4 0.2 Iris-setosa
    5 5.4 3.9 1.7 0.4 Iris-setosa
    6 4.6 3.4 1.4 0.3 Iris-setosa
    7 5.0 3.4 1.5 0.2 Iris-setosa
    8 4.4 2.9 1.4 0.2 Iris-setosa
    9 4.9 3.1 1.5 0.1 Iris-setosa
    10 5.4 3.7 1.5 0.2 Iris-setosa
    11 4.8 3.4 1.6 0.2 Iris-setosa
    12 4.8 3.0 1.4 0.1 Iris-setosa
    13 4.3 3.0 1.1 0.1 Iris-setosa
    14 5.8 4.0 1.2 0.2 Iris-setosa
    15 5.7 4.4 1.5 0.4 Iris-setosa
    16 5.4 3.9 1.3 0.4 Iris-setosa
    17 5.1 3.5 1.4 0.3 Iris-setosa
    18 5.7 3.8 1.7 0.3 Iris-setosa
    19 5.1 3.8 1.5 0.3 Iris-setosa
    >>> print(datase

    It does not let me type anything further.
    I would appreciate your help.


    • Jason Brownlee February 1, 2017 at 10:55 am #

      Hi Purvi, sorry to hear that.

      Perhaps you’re able to comment out the first parts of the tutorial and see if you can progress?

  58. sam February 5, 2017 at 9:24 am #

    Hi Jason

    i am planning to use python to predict customer attrition.I have current list of attrited customers with their attributes.I would like to use them as test data and use them to predict any new customers.Can you please help to approach the problem in python ?

    my test data :

    customer1 attribute1 attribute2 attribute3 … attrited

    my new data

    customer N, attribute 1,…… ?

    Thanks for your help in advance.

  59. Kiran Prajapati February 7, 2017 at 6:31 pm #

    Hello Sir, I want to check my data is how many % accurate, In my data , I have 4 columns ,

    Taluka , Total_yield, Rain(mm) , types_of soil

    Nasik 12555 63.0 dark black
    Igatpuri 1560 75.0 shallow

    So on,
    first, I have to check data is accurate or not, and next step is what is the predicted yield , using regression model.
    Here is my model Total_yield = Rain + types_of soil

    I use 0 and 1 binary variable for types_of soil.

    can you please help me, how to calculate data is accurate ? How many % ?
    and how to find predicted yield ?

  60. Saby February 15, 2017 at 9:11 am #

    # Load dataset
    url = “”
    names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    dataset = pandas.read_csv(url, names=names)

    The dataset should load without incident.

    If you do have network problems, you can download the file into your working directory and load it using the same method, changing url to the local file name.

    I am a very beginner python learner(trying to learn ML as well), I tried to load data from my local file but could not be successful. Will you help me out how exactly code should be written to open the data from local file.

    • Jason Brownlee February 15, 2017 at 11:39 am #


      Download the file as into your current working directory (where your python file is located and where you are running the code from).

      Then load it as:

  61. ant February 15, 2017 at 9:54 pm #

    Hi, Jason, first of all thank so much for this amazing lesson.

    Just for curiosity I have computed all the values obtained with dataset.describe() with excel and for the 25% value of petal-length I get 1.57500 instead of 1.60000. I have googled for formatting describe() output unsuccessfully. Is there an explanation? Tnx

    • Jason Brownlee February 16, 2017 at 11:07 am #

      Not sure, perhaps you could look into the Pandas source code?

      • ant February 17, 2017 at 12:23 am #

        OK, I will do.

  62. jacques February 16, 2017 at 4:42 pm #

    HI Jason

    I don’t quite follow the KFOLD section ?

    We started of with 150 data-entries(rows)

    We then use a 80/20 split for validation/training that leaves us with 120

    The split 10 boggles me ??
    Does it take 10 items from each class and train with 9 ? what does the other 1 left do then ?

    • Jason Brownlee February 17, 2017 at 9:52 am #

      Hi jacques,

      The 120 records are split into 10 folds. The model is trained on the first 9 folds and evaluated on the records in the 10th. This is repeated so that each fold is given a chance to be the hold out set. 10 models are trained, 10 scores collected and we report the mean of those scores as an estimate of the performance of the model on unseen data.

      Does thar help?

  63. Alhassan February 17, 2017 at 4:02 pm #

    I am trying to integrate machine learning into a PHP website I have created. Is there any way I can do that using the guidelines you provided above?

    • Jason Brownlee February 18, 2017 at 8:34 am #

      I have not done this Alhassan.

      Generally, I would advise developing a separate service that could be called using REST calls or similar.

      If you are working on a prototype, you may be able to call out to a program or script from cgi-bin, but this would require careful engineering to be secure in a production environment.

  64. Simão Gonçalves February 20, 2017 at 1:27 am #

    Hi Jason! This tutorial was a great help, i´m truly grateful for this so thank you.

    I have one question about the tutorial though, in the Scattplot Matrix i can´t understand how can we make the dots in the graphs whose variables have no relationship between them (like sepal-lenght with petal_width).

    Could you or someone explain that please? how do you make a dot that represents the relationship between a certain sepal_length with a certain petal-width

    • Jason Brownlee February 20, 2017 at 9:30 am #

      Hi Simão,

      The x-axis is taken for the values of the first variable (e.g. sepal_length) and the y-axis is taken for the second variable (e.g. petal_width).

      Does that help?

    • Yopo February 21, 2017 at 4:35 am #

      you match each iris instance’s length and width with each other. for example, iris instance number one is represented by a dot, and the dot’s values are the iris length and width! so actually, when you take all these values and put them on a graph you are basically checking to see if there is a relation. as you can see some in some of these plots the dots are scattered all around, but when we look at the petal width – petal length graph it seems to be linear! this means that those two properties are clearly related. hope this hepled!

  65. Sébastien February 20, 2017 at 9:34 pm #

    Hi Jason,

    from France and just to say you “Thank you for this very clear tutorial!”


  66. Raj February 27, 2017 at 2:53 am #

    Hi Jason,
    I am new to ML & Python. Your post is encouraging and straight to the point of execution. Anyhow, I am facing below error when

    >>> validataion_size = 0.20
    >>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state = seed)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘validation_size’ is not defined

    What could be the miss out? I din’t get any errors in previous steps.

    My Environment details:
    OS: Windows 10
    Python : 3.5.2
    scipy : 0.18.1
    numpy : 1.11.1
    sklearn : 0.18.1
    matplotlib : 0.18.1

    • Jason Brownlee February 27, 2017 at 5:54 am #

      Hi Raj,

      Double check you have the code from section “5.1 Create a Validation Dataset” where validation_size is defined.

      I hope that helps.

  67. Roy March 2, 2017 at 7:38 am #

    Hey Jason,

    Can you please explain what precision,recall, f1-score, support actually refer to?
    Also what the numbers in a confusion matrix refers to?
    [ 7 0 0]
    [ 0 11 1]
    [ 0 2 9]]

  68. santosh March 3, 2017 at 7:29 am #

    what code should i use to load data from my working directory??

  69. David March 7, 2017 at 8:27 am #

    Hi Jason,

    I have a ValueError and i don’t know how can i solve this problem

    My problem like that,

    ValueError: could not convert string to float: ‘2013-06-27 11:30:00.0000000’

    Can u give some information abaut the fixing this problem?

    Thank you

    • Jason Brownlee March 7, 2017 at 9:39 am #

      It looks like you are trying to load a date-time. You might need to write a custom function to parse the date-time when loading or try removing this column from your dataset.

  70. Saugata De March 8, 2017 at 6:11 am #

    >>> for name, model in models:
    … kfold=model_selection.Kfold(n_splits=10, random_state=seed)
    … cv_results =model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    … results.append(cv_results)
    … names.append(name)
    … msg=”%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    … print(msg)

    After typing this piece of code, it is giving me this error. can you plz help me out Jason. Since I am new to ML, dont have so much idea about the error.

    Traceback (most recent call last):
    File “”, line 2, in
    AttributeError: module ‘sklearn.model_selection’ has no attribute ‘Kfold’

  71. Ojas March 10, 2017 at 10:58 am #

    Hello Jason ,
    Thanks for writing such a nice and explanatory article for beginners like me but i have one concern , i tried finding it out on other websites as well but could not come up with any solution.
    Whatever i am writing inside the code editor (Jupyter Qtconsole in my case) , can this not be save as a .py file and shared with my other members over github maybe?. I found some hacks though but i have a thinking that there must be some proper way of sharing the codes written in the editor. , like without the outputs or plots in between.

    • Jason Brownlee March 11, 2017 at 7:55 am #

      You can write Python code in a text editor and save it as a file. You can then run it on the command line as follows:

      Consider picking up a book on Python.

  72. manoj maracheea March 11, 2017 at 9:37 pm #

    Hello Jason,

    Nice tutorials I done this today.

    I didn’t really understand everything, { I will follow your advice, will do it again, write all the question down, and use the help function.}

    The tutorials just works, I take around 2 hours to do it typing every single line.
    install all the dependencies, run on each blocks types, to check.

    Thanks, I be visiting your blogs, time to time.


    • Jason Brownlee March 12, 2017 at 8:23 am #

      Well done, and thanks for your support.

      Post any questions you have as comments or email me using the “contact” page.

  73. manoj maracheea March 11, 2017 at 9:38 pm #

    Just I am a beginner too, I am using Visual studio code.

    Look good.

  74. Vignesh R March 13, 2017 at 9:59 pm #

    What exactly is confusion matrix?

  75. Dan R. March 14, 2017 at 7:09 am #

    Can I ask what is the reason of this problem? Thank for answer 🙂 :
    (In my code is just the section, where I Import all the needed libraries..)
    I have all libraries up to date, but it still gives me this error->

    File “C:\Users\64dri\Anaconda3\lib\site-packages\sklearn\model_selection\”, line 32, in
    from ..utils.fixes import rankdata

    ImportError: cannot import name ‘rankdata’

    ( scipy: 0.18.1
    numpy: 1.11.1
    matplotlib: 1.5.3
    pandas: 0.18.1
    sklearn: 0.17.1)

    • Jason Brownlee March 14, 2017 at 8:31 am #

      Sorry, I have not seen this issue Dan, consider searching or posting to StackOverflow.

  76. Cameron March 15, 2017 at 5:28 am #


    You’re a rockstar, thank you so much for this tutorial and for your books! It’s been hugely helpful in getting me started on machine learning. I was curious, is it possible to add a non-number property column, or will the algorithms only accept numbers?

    For example, if there were a “COLOR” column in the iris dataset, and all Iris-Setosa were blue. how could I get this program to accept and process that COLOR column? I’ve tried a few things and they all seem to fail.

    • Jason Brownlee March 15, 2017 at 8:16 am #

      Great question Cameron!

      sklearn requires all input data to be numbers.

      You can encode labels like colors as integers and model that.

      Further, you can convert the integers to a binary encoding/one-hot encoding which may me more suitable if there is no ordinal relationship between the labels.

      • Cameron March 15, 2017 at 2:19 pm #

        Jason, thanks so much for replying! That makes a lot of sense. When you say binary/one-hot encoding I assume you mean (continuing to use the colors example) adding a column for each color (R,O,Y,G,B,V) and for each flower putting a 1 in the column of it’s color and a 0 for all of the other colors?
        That’s feasible for 6 colors (adding six columns) but how would I manage if I wanted to choose between 100 colors or 1000 colors? Are there other libraries that could help deal with that?

  77. James March 19, 2017 at 6:54 am #

    for name, model in models:
    … kfold = cross_vaalidation.KFold(n=num_instances,n_folds=num_folds,random_state=seed)
    … cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    File “”, line 3
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    SyntaxError: invalid syntax
    >>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘model’ is not defined
    >>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘kfold’ is not defined
    >>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv =
    kfold, scoring = scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘kfold’ is not defined
    >>> names.append(name)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘name’ is not defined

    I am new to python and getting these errors after running 5.3 models. Please help me.

    • Jason Brownlee March 19, 2017 at 9:12 am #

      It looks like you might not have copied all of the code required for the example.

  78. Mier March 20, 2017 at 10:26 am #

    Hi, I went through your tutorial. It is super great!
    I wonder whether you can recommend a data set that is similar to Iris classification for me to practice?

  79. Medine H. March 23, 2017 at 2:56 am #

    Hi Jason,

    That’s an amazing tutorial, quite clear and useful.

    Thanks a bunch!

  80. Sean March 23, 2017 at 9:54 am #

    Hi Jason,

    Can you let me know how can I start with Fraud Detection algorithms for a retail website ?


  81. Raja March 24, 2017 at 11:08 am #

    You are doing great with your work.

    I need your suggestion, i am working on my thesis here i need to work on machine learning.
    Training : positive ,negative, others
    Test : unknown data
    Want to train machine with training and test with unknown data using SVM,Naive,KNN

    How can i make the format of training and test data ?
    And how to use those algorithms in it
    Using which i can get the TP,TN,FP,FN
    Thanking you..

  82. Sey March 26, 2017 at 12:38 am #

    I m new in Machine learning and this was a really helpful tutorial. I have maybe a stupid question I wanted to plot the predictions and the validation value and make a visual comparison and it doesn’t seem like I really understood how I can plot it.
    Can you please send me the piece of code with some explanations to do it ?

    thank you very much

    • Jason Brownlee March 26, 2017 at 6:13 am #

      You can use matplotlib, for example:

  83. Kamol Roy March 26, 2017 at 7:25 am #

    Thanks a lot. It was very helpful.

Leave a Reply