Your First Machine Learning Project in Python Step-By-Step

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this post you will complete your first machine learning project using Python.

In this step-by-step tutorial you will:

  1. Download and install Python SciPy and get the most useful package for machine learning in Python.
  2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
  3. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

Let’s get started!

  • Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
Your First Machine Learning Project in Python Step-By-Step

Your First Machine Learning Project in Python Step-By-Step
Photo by cosmoflash, some rights reserved.

How Do You Start Machine Learning in Python?

The best way to learn machine learning is by designing and completing small projects.

Python Can Be Intimidating When Getting Started

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.

There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

  • It will force you to install and start the Python interpreter (at the very least).
  • It will given you a bird’s eye view of how to step through a small project.
  • It will give you confidence, maybe to go on to your own small projects.

Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

A machine learning project may not be linear, but it has a number of well known steps:

  1. Define Problem.
  2. Prepare Data.
  3. Evaluate Algorithms.
  4. Improve Results.
  5. Present Results.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

  • Attributes are numeric so you have to figure out how to load and handle data.
  • It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
  • It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
  • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
  • All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in Python.

Machine Learning in Python: Step-By-Step Tutorial (start here)

In this section we are going to work through a small machine learning project end-to-end.

Here is an overview of what we are going to cover:

  1. Installing the Python and SciPy platform.
  2. Loading the dataset.
  3. Summarizing the dataset.
  4. Visualizing the dataset.
  5. Evaluating some algorithms.
  6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

If you have any questions at all, please leave a comment at the bottom of the post.

Beat Information Overload and Master the Fastest Growing Platform of Machine Learning Pros

Machine Learning Mastery With Python Mini-CourseGet my free Machine Learning With Python mini course and start loading your own datasets from CSV in just 1 hour.

Daily lessons in your inbox for 14 days, and a Machine-Learning-With-Python “Cheat Sheet” you can download right now.

Download Your FREE Mini-Course >>


1. Downloading, Installing and Starting Python SciPy

Get the Python and SciPy platform installed on your system if it is not already.

I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.

1.1 Install SciPy Libraries

This tutorial assumes Python version 2.7.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

  • scipy
  • numpy
  • matplotlib
  • pandas
  • sklearn

There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.

  • On Mac OS X, you can use macports to install Python 2.7 and these libraries. For more information on macports, see the homepage.
  • On Linux you can use your package manager, such as yum on Fedora to install RPMs.

If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.

Note: This tutorial assumes you have scikit-learn version 0.18 or higher installed.

1.2 Start Python and Check Versions

It is a good idea to make sure your Python environment was installed successfully and is working as expected.

The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.

Open a command line and start the python interpreter:

I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.

Type or copy and paste the following script:

Here is the output I get on my OS X workstation:

Compare the above output to your versions.

Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.

If you get an error, stop. Now is the time to fix it.

If you cannot run the above script cleanly you will not be able to complete this tutorial.

My best advice is to Google search for your error message or post a question on Stack Exchange.

2. Load The Data

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

You can learn more about this dataset on Wikipedia.

In this step we are going to load the iris data from CSV file URL.

2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.

2.2 Load Dataset

We can load the data directly from the UCI Machine Learning repository.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

The dataset should load without incident.

If you do have network problems, you can download the file into your working directory and load it using the same method, changing url to the local file name.

3. Summarize the Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Peek at the data itself.
  3. Statistical summary of all attributes.
  4. Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 120 instances and 5 attributes:

3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 20 rows of the data:

3.3 Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

We can see that each class has the same number of instances (50 or 33% of the dataset).

4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

  1. Univariate plots to better understand each attribute.
  2. Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots

Box and Whisker Plots

We can also create a histogram of each input variable to get an idea of the distribution.

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Histogram Plots

Histogram Plots

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Scattplot Matrix

Scattplot Matrix

5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

  1. Separate out a validation dataset.
  2. Set-up the test harness to use 10-fold cross validation.
  3. Build 5 different models to predict species from flower measurements
  4. Select the best model.

5.1 Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.

5.2 Test Harness

We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB).
  • Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build and evaluate our five models:

5.3 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

We can see that it looks like KNN has the largest estimated accuracy score.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.

Compare Algorithm Accuracy

Compare Algorithm Accuracy

6. Make Predictions

The KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

You Can Do Machine Learning in Python

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.


In this post you discovered step-by-step how to complete your first machine learning project in Python.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Your Next Step

Do you work through the tutorial?

  1. Work through the above tutorial.
  2. List any questions you have.
  3. Search or research the answers.
  4. Remember, you can use the help(“FunctionName”) in Python to get help on any function.

Do you have a question? Post it in the comments below.

Frustrated With Python Machine Learning?

Develop Your Own Models and Predictions in Minutes

...with just a few lines of scikit-learn code

Discover how in my new Ebook: Machine Learning Mastery With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.


101 Responses to Your First Machine Learning Project in Python Step-By-Step

  1. DR Venugopala Rao Manneni June 11, 2016 at 5:58 pm #

    Awesome… But in your Blog please introduce SOM ( Self Organizing maps) for unsupervised methods and also add printing parameters ( Coefficients )code.

    • Jason Brownlee June 14, 2016 at 8:17 am #

      I generally don’t cover unsupervised methods like clustering and projection methods.

      This is because I mainly focus on and teach predictive modeling (e.g. classification and regression) and I just don’t find unsupervised methods that useful.

  2. Jan de Lange June 20, 2016 at 10:43 pm #

    Nice work Jason. Of course there is a lot more to tell about the code and the Models applied if this is intended for people starting out with ML (like me). Rather than telling which “button to press” to make work, it would be nice to know why also. I looked at a sample of you book (advanced) if you are covering the why also, but it looks like it’s limited?

    On this particular example, in my case SVM reached 99.2% and was thus the best Model. I gather this is because the test and training sets are drawn randomly from the data.

  3. Nil June 25, 2016 at 12:42 am #

    Awesome, I have tested the code it is impressive. But how could I use the model to predict if it is Iris-setosa or Iris-versicolor or Iris-virginica when I am given some values representing sepal-length, sepal-width, petal-length and petal-width attributes?

    • Jason Brownlee June 25, 2016 at 5:09 am #

      Great question. You can call model.predict() with some new data.

      For an example, see Part 6 in the above post.

  4. Sujon September 6, 2016 at 8:19 am #

    Dear Sir,

    It seems I’m in the right place in right time! I’m doing my master thesis in machine learning from Stockholm University. Could you give me some references for laughter audio conversation to CSV file? You can send me anything on Thanks a lot and wish your very best and will keep in touch.

  5. Sujon September 6, 2016 at 8:32 am #

    Sorry I mean laughter audio to CSV conversion.

    • Jason Brownlee September 6, 2016 at 9:49 am #

      Sorry, I have not seen any laughter audio to CSV conversion tools/techniques.

  6. Roberto U September 19, 2016 at 9:17 am #

    Sweet way of condensing monstrous amount of information in a one-way street. Thanks!

    Just a small thing, you are creating the Kfold inside the loop in the cross validation. Then, you use the same seed to keep the comparison across predictors constant.

    That works, but I think it would be better to take it out of the loop. Not only is more efficient, but it is also much immediately clearer that all predictors are using the same Kfold.

    You can still justify the use of the seeds in terms of replicability; readers getting the same results on their machines.

    Thanks again!

  7. Francisco September 20, 2016 at 2:02 am #

    Hello Jaso.
    Thank you so much for your help with Machine Learning and congratulations for your excellent website.

    I am a beginner in ML and DeepLearning. Should I download Python 2 or Python 3?

    Thank you very much.


    • Jason Brownlee September 20, 2016 at 8:33 am #

      I use Python 2 for all my work, but my students report that most of my examples work in Python 3 with little change.

  8. ShawnJ October 11, 2016 at 5:24 am #


    Thank you so much for putting this together. I am been a software developer for almost two decades and am getting interested in machine learning. Found this tutorial accurate, easy to follow and very informative.

  9. Wendy G October 14, 2016 at 5:37 am #


    Thanks for the great post! I am trying to follow this post by using my own dataset, but I keep getting this error “Unknown label type: array ([some numbers from my dataset])”. So what’s the problem on earth, any possible solutions?


    • Jason Brownlee October 14, 2016 at 9:08 am #

      Hi Wendy,

      Carefully check your data. Maybe print it on the screen and inspect it. You may have some string values that you may need to convert to numbers using data preparation.

  10. fara October 20, 2016 at 7:15 am #

    hi thanks for great tutorial, i’m also new to ML…this really helps but i was wondering what if we have non-numeric values? i have mixture of numeric and non-numeric data and obviously this only works for numeric. do you also have a tutorial for that or would you please send me a source for it? thank you

    • Jason Brownlee October 20, 2016 at 8:41 am #

      Great question fara.

      We need to convert everything to numeric. For categorical values, you can convert them to integers (label encoding) and then to new binary features (one hot encoding).

  11. Mazhar Dootio October 23, 2016 at 9:14 pm #

    Hello Jason
    Thank you for publishing this great machine learning tutorial.
    It is really awesome awesome awesome………..!
    I test your tutorial on python-3 and it works well but what I face here is to load my data set from my local drive. I followed your give instructions but couldn’t be successful.
    My syntax is as under:

    import unicodedata
    url = open(r’C:\Users\mazhar\Anaconda3\Lib\site-packages\sindhi2.csv’, encoding=’utf-8′).readlines()
    names = [‘class’, ‘sno’, ‘gender’, ‘morphology’, ‘stem’,’fword’]
    dataset = pandas.read_csv(url, names=names)

    python-3 jupyter notebook does not loads this. Kindly help me in regard.

    • Jason Brownlee October 24, 2016 at 7:05 am #

      Hi Mazhar, thanks.

      Are you able to load the file on the command line away from the notebook?

      Perhaps the notebook environment is causing trouble?

  12. Mazhar Dootio October 25, 2016 at 3:22 am #

    Dear Jason
    Thank you for response
    I am using Python 3 with anaconda jupyter notebook
    so which python version you would like to suggest me and kindly write here syntax of opening local dataset file from local drive that how can I load utf-8 dataset file from my local drive.

    • Jason Brownlee October 25, 2016 at 8:32 am #

      Hi Mazhar, I teach using Python 2.7 with examples from the command line.

      Many of my students report that the code works in Python 3 and in notebooks with little or no changes.

  13. Andy October 27, 2016 at 11:59 pm #

    Great tutorial but perhaps I’m missing something here. Let’s assume I already know what model to use (perhaps because I know the data well… for example).

    knn = KNeighborsClassifier(), Y_train)

    I then use the models to predict:
    print(knn.predict(an array of variables of a record I want to classify))

    Is this where the whole ML happens?, Y_train)

    What’s the difference between this and say a non ML model/algorithm? Is it that in a non ML model I have to find the coefficients/parameters myself by statistical methods?; and in the ML model the machine does that itself?
    If this is the case then to me it seems that a researcher/coder did most of the work for me and wrap it in a nice function. Am I missing something? What is special here?

    • Jason Brownlee October 28, 2016 at 9:14 am #

      Hi Andy,

      Yes, your comment is generally true.

      The work is in the library and choice of good libraries and training on how to use them well on your project can take you a very long way very quickly.

      Stats is really about small data and understanding the domain (descriptive models). Machine learning, at least in common practice, is leaning towards automation with larger datasets and making predictions (predictive modeling) at the expense of model interpretation/understandability. Prediction performance trumps traditional goals of stats.

      Because of the automation, the focus shifts more toward data quality, problem framing, feature engineering, automatic algorithm tuning and ensemble methods (combining predictive models), with the algorithms themselves taking more of a backseat role.

      Does that make sense?

      • Andy November 3, 2016 at 10:36 pm #

        It does make sense.
        You mentioned ‘data quality’. That’s currently my field of work. I’ve been doing this statistically until now, and very keen to try a different approach. As a practical example how would you use ML to spot an error/outlier using ML instead of stats?
        Let’s say I have a large dataset containing trees: each tree record contains a specie, height, location, crown size, age, etc… (ah! suspiciously similar to the iris flowers dataset 🙂 Is ML a viable method for finding incorrect data and replace with an “estimated” value? The answer I guess is yes. For species I could use almost an identical method to what you presented here; BUT what about continuous values such as tree height?

        • Jason Brownlee November 4, 2016 at 9:08 am #

          Hi Andy,

          Maybe “outliers” are instances that cannot be easily predicted or assigned ambiguous predicted probabilities.

          Instance values can be “fixed” by estimating new values, but whole instance can also be pulled out if data is cheap.

  14. Shailendra Khadayat October 30, 2016 at 2:23 pm #

    Awesome work Jason. This was very helpful and expect more tutorials in the future.


  15. Shuvam Ghosh November 16, 2016 at 12:13 am #

    Awesome work. Students need to know how the end results will look like. They need to get motivated to learn and one of the effective means of getting motivated is to be able to see and experience the wonderful end results. Honestly, if i were made to study algorithms and understand them i would get bored. But now since i know what amazing results they give, they will serve as driving forces in me to get into details of it and do more research on it. This is where i hate the orthodox college ways of teaching. First get the theory right then apply. No way. I need to see things first to get motivated.

    • Jason Brownlee November 16, 2016 at 9:29 am #

      Thanks Shuvam,

      I’m glad my results-first approach gels with you. It’s great to have you here.

  16. Puneet November 17, 2016 at 12:08 am #

    Thanks Jason,

    while i am trying to complete this.

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())

    showing below error.-

    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    IndentationError: expected an indented block-

    • Jason Brownlee November 17, 2016 at 9:54 am #

      Hi Puneet, looks like a copy-paste error.

      Check for any extra new lines or white space around that line that is reporting the error.

  17. Puneet November 17, 2016 at 12:30 am #

    Thanks Json,

    I am new to ML. need your help so i can run this.

    as i have followed the steps but when trying to build and evalute 5 model using this.

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())

    facing below mentioned issue.
    File “”, line 13
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    IndentationError: expected an indented block

    Kindly help.

    • Martin November 18, 2016 at 5:18 am #

      Puneet, you need to indent the block (tab or four spaces to the right). That is the way of building a block in Python

  18. george soilis November 17, 2016 at 10:00 pm #

    just another Python noob here,sending many regards and thanks to Jason :):)

  19. sergio November 22, 2016 at 3:29 pm #

    Does this tutorial work with other data sets? I’m trying to work on a small assignment and I want to use python

    • Jason Brownlee November 23, 2016 at 8:50 am #

      It should provide a great template for new projects sergio.

  20. Albert November 26, 2016 at 1:55 am #

    Very Awesome step by step for me ! Even I am beginner of python , this gave me many things about Machine learning ~ supervised ML. Appreciate of your sharing !!

  21. Umar Yusuf November 27, 2016 at 4:04 am #

    Thank you for the step by step instructions. This will go along way for newbies like me getting started with machine learning.

    • Jason Brownlee November 27, 2016 at 10:21 am #

      You’re welcome, I’m glad you found the post useful Umar.

  22. Mike P November 30, 2016 at 6:29 pm #

    Hi Jason,

    Really nice tutorial. I had one question which has had me confused. Once you chose your best model, (in this instance KNN) you then train a new model to be used to make predictions against the validation set. should one not perform K-fold cross-validation on this model to ensure we don’t overfit?

    if this is correct how would you implement this, from my understanding cross_val_score will not allow one to generate a confusion matrix.

    I think this is the only thing that I have struggled with in using scikit learn if you could help me it would be much appreciated?

    • Jason Brownlee December 1, 2016 at 7:26 am #

      Hi Mike. No.

      Cross-validation is just a method to estimate the skill of a model on new data. Once you have the estimate you can get on with things, like confirming you have not fooled yourself (hold out validation dataset) or make predictions on new data.

      The skill you report is the cross val skill with the mean and stdev to give some idea of confidence or spread.

      Does that make sense?

      • Mike December 2, 2016 at 1:30 am #

        Hi Jason,

        Thanks for the quick response. So to make sure I understand, one would use cross validation to get a estimate of the skill of a model (mean of cross val scores) or chose the correct hyper parameters for a particular model.

        Once you have this information you can just go ahead and train the chosen model with the full training set and test it against the validation set or new data?

        • Jason Brownlee December 2, 2016 at 8:17 am #

          Hi Mike. Correct.

          Additionally, if the validation result confirms your expectations, you can go ahead and train the model on all data you have including the validation dataset and then start using it in production.

          This is a very important topic. I think I’ll write a post about it.

  23. Sahana Venkatesh November 30, 2016 at 8:15 pm #

    This is amazing 🙂 You boosted my morale

  24. Jhon November 30, 2016 at 8:27 pm #

    while doing data visualization and running commands dataset.plot(……..) i am having the following error.kindly tell me how to fix it

    ]], dtype=object)

    • Jason Brownlee December 1, 2016 at 7:28 am #

      Looks like no data Jhon. It also looks like it’s printing out an object.

      Are you running in a notebook or on the command line? The code was intended to be run directly (e.g. command line).

  25. Brendon A. Kay December 1, 2016 at 4:20 am #

    Hi Jason,

    Great tutorial. I am a developer with a computer science degree and a heavy interest in machine learning and mathematics, although I don’t quite have the academic background for the latter except for what was required in college. So, this website has really sparked my interest as it has allowed me to learn the field in sort of the “opposite direction”.

    I did notice when executing your code that there was a deprecation warning for the sklearn.cross_validation module. They recommend switching to sklearn.model_selection.

    When switching the modules I adjusted the following line…

    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)


    kfold = model_selection.KFold(n_folds=num_folds, random_state=seed)

    … and it appears to be working okay. Of course, I had switched all other instances of cross_validation as well, but it seemed to be that the KFold() method dropped the n (number of instances) parameter, which caused a runtime error. Also, I dropped the num_instances variable.

    I could have missed something here, so please let me know if this is not a valid replacement, but thought I’d share!

    Once again, great website!

    • Jason Brownlee December 1, 2016 at 7:33 am #

      Thanks for the support and the kind words Brendon. I really appreciate it (you made my day!)

      Yes, the API has changed/is changing and your updates to the tutorial look good to me, except I think n_folds has become n_splits.

      I will update this example for the new API very soon.

  26. Sergio December 1, 2016 at 3:41 pm #

    I’m still having a little trouble understanding step 5.1. I’m trying to apply this tutorial to a new data set but, when I try to evaluate the models from 5.3 I don’t get a result.

    • Jason Brownlee December 2, 2016 at 8:13 am #

      What is the problem exactly Sergio?

      Step 5.1 should create a validation dataset. You can confirm the dataset by printing it out.

      Step 5.3 should print the result of each algorithm as it is trained and evaluated.

      Perhaps check for a copy-paste error or something?

      • sergio December 2, 2016 at 9:13 am #

        Does this tutorial work the exact same way for other data sets? because I’m not using the Hello World dataset

        • Jason Brownlee December 3, 2016 at 8:23 am #

          The project template is quite transferable.

          You will need to adapt it for your data and for the types of algorithms you want to test.

  27. Jean-Baptiste Hubert December 11, 2016 at 12:17 am #

    Hi Sir,
    Thank you for the information.
    I am currently a student, in Engineering school in France.
    I am working on date mining project, indeed, I have a many date ( 40Go ) about the price of the stocks of many companies in the CAC40.
    My goal is to predict the evolution of the yields and I think that Neural Network could be useful.
    My idea is : I take for X the yields from “t=0” to “t=n” and for Y the yields from “t=1 to t=n” and the program should find a relation between the data.
    Is that possible ? Is it a good way in order to predict the evolution of the yield ?
    Thank you for your time

  28. Ernest Bonat December 15, 2016 at 5:33 pm #

    Hi Jason,

    If I include an new item in the models array as:

    models.append((‘LNR – Linear Regression’, LinearRegression()))

    with the library:

    from sklearn.linear_model import LinearRegression

    I got an error in the \sklearn\utils\”, line 529, in check_X_y
    y = y.astype(np.float64)


    ValueError: could not convert string to float: ‘Iris-setosa’

    Let me know best to fix that! As you can see from my code, I would like to include the Linear Regression algorithms in my array model too!

    Thank you for your help,


    • Jason Brownlee December 16, 2016 at 5:39 am #

      Hi Ernest, it is a classification problem. We cannot use LinearRegression.

      Try adding another classification algorithm to the list.

  29. Gokul Iyer December 20, 2016 at 2:29 pm #

    Great tutorial! Quick question, for the when we create the models, we do models.append(name of algorithm, alogrithm function), is models an array? Because it seems like a dictionary since we have a key-value mapping (algorithm name, and algorithm function). Thank you!

    • Jason Brownlee December 20, 2016 at 2:47 pm #

      It is a list of tuples where each tuple contains a string name and a model object.

  30. Sasanka ghosh December 21, 2016 at 4:55 am #

    Hi Jason /any Gurus ,
    Good post and will follow it but my question may be little off track.
    Asking this question as i am a data modeller /aspiring data architect.

    I i feel as Guru/Gurus you can clarify my doubt. The question is at the end .

    In current Data management environment

    1. Data architecture /Physical implementation and choosing appropriate tools,back end,storage,no sql, SQL, MPP, sharding, columnar ,,scale up/out ,distributed processing etc .

    2. In addition to DB based procedural languages proficiency at at least one of the following i.e. Java/Python/Scala etc.

    3. Then comes this AI,Machine learning ,neural Networks etc .

    My question is regarding point 3 .

    I believe those are algorithms which needs deep functional knowledge and years of experience to add any value to business .

    Those are independent of data models and it’s ,physical implementation and part of Business user domain not data architecture domain .

    If i take your above example say now 10k users trying to do the similar kind of thing then points 1 and 2 will be Data architects a domain and point 3 will be business analyst domain . may be point 2 can overlap between them to some extent .

    Data Architect need not to be hands on/proficient in algorithms i.e. should have just some basic idea as Data architects job is not to invent business logic but implement the business logic physically to satisfy Business users/Analysts .

    Am i correct in my assumption as i find the certain things are nearly mutually exclusive and expectations/benchmarks should be set right?

    sasanka ghosh

    • Jason Brownlee December 21, 2016 at 8:46 am #

      Hi Sasanka, sorry, I don’t really follow.

      Are you able to simplify your question?

      • Sasanka ghosh December 21, 2016 at 9:25 pm #

        Hi Jason ,
        Many thanks that u bothered to reply .

        Tried to rephrase and concise but still it is verbose . apologies for that.

        Is it expected from a data architect to be algorithm expert as well as data model/database expert?

        Algorithms are business centric as well as specific to particular domain of business most of the times.

        Giving u an example i.e. SHORTEST PATH ( take it as just an example in making my point)
        An organization is providing an app to provide that service .

        CAVEAT:Someone may say from comp science dept that it is the basic thing u learn but i feel it is still an algorithm not a data structure .

        if we take the above scenario in simplistic term the requirement is as follows

        1.there will be say million registered users
        2. one can say at least 10 % are using the app same time
        3. any time they can change their direction as per contingency like a military op so dumping the partial weighted graph to their device is not an option i.e. users will be connected to main server/server cluster.
        4. the challenge is storing the spatial data in DB in correct data model .
        scale out ,fault tolerance .
        5.implement the shortest path algo and display it using Python/java/Cipher/Oracle spatial/titan etc dynamically.

        My question is can a data architect work on this project who does not know the shortest path algorithm but have sufficient knowledge in other areas but the algo with verbose term provided to him/her to implement ?

        I m asking this question as now a days people are offering ready made courses etc i.e. machine learning ,NLP,Data scientist etc. and the scenario is confusing
        i feel it is misleading as no one can get expert in science overnight and vice versa.

        I feel Algorithms are pure science that is a separate discipline .
        But to implement it in large scale Scientists/programmers/architects needs to work in tandem with minimal overlapping but continuous discussion.

        Last but not the least if i make some sense what is the learning curve should i follow to try to be a data architect in unstructured data in general

        sasanka ghosh

        • Jason Brownlee December 22, 2016 at 6:35 am #

          Really this depends on the industry and the job. I cannot give you good advice for the general case.

          You can get valuable results without being an expert, this applies to most fields.

          Algorithms are a tool, use them as such. They can also be a science, but we practitioners don’t have the time.

          I hope that helps.

          • Sasanka ghosh December 22, 2016 at 7:00 pm #

            Thanks Jsaon.

            I appreciate your time and response .

            I just wanted to validate from a real techie/guru like u as the confusion or no perfect answer are being exploited by management/HR to their own advantage and practice use and throw policy or make people sycophants/redundant without following the basic management principle,

            The tech guys except ” few geniuses” are always toiling and management is opening the cork, enjoying at the same time .

            sasanka ghosh

  31. Raveen Sachintha December 21, 2016 at 8:51 pm #

    Hello Jason,
    Thank you very much for these tutorials, i am new to ML and i find it very encouraging to do and end to end project to get started with rather than reading and reading without seeing and end, This really helped me..

    One question, when i tried this i got the highest accuracy for SVM.

    LR: 0.966667 (0.040825)
    LDA: 0.975000 (0.038188)
    KNN: 0.983333 (0.033333)
    CART: 0.983333 (0.033333)
    NB: 0.975000 (0.053359)
    SVM: 0.991667 (0.025000)

    so i decided to try that out too,,

    svm = SVC(), Y_train)
    prediction = svm.predict(X_validation)

    these were my results using SVM,

    [[ 7 0 0]
    [ 0 10 2]
    [ 0 0 11]]
    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 7
    Iris-versicolor 1.00 0.83 0.91 12
    Iris-virginica 0.85 1.00 0.92 11

    avg / total 0.94 0.93 0.93 30

    I am still learning to read these results, but can you tell me why this happened? why did i get high accuracy for SVM instead of KNN?? have i done anything wrong? or is it possible?

    • Jason Brownlee December 22, 2016 at 6:33 am #

      The results reported are a mean estimated score with some variance (spread).

      It is an estimate on the performance on new data.

      When you apply the method on new data, the performance may be in that range. It may be lower if the method has overfit the training data.

      Overfitting is a challenge and developing a robust test harness to ensure we don’t fool/mislead ourselves during model development is important work.

      I hope that helps as a start.

  32. inzar December 25, 2016 at 7:04 am #

    i want to buy your book.
    i try this tutorial and the result is very awesome

    i want to learn from you


  33. lou December 25, 2016 at 7:29 am #

    Why the leading comma in X = array[:,0:4]?

  34. Thinh December 26, 2016 at 5:05 am #

    In 1.2 , should warn to install scikit-learn

    • Jason Brownlee December 26, 2016 at 7:49 am #

      Thanks for the note.

      Please see section 1.1 Install SciPy Libraries where it says:

      There are 5 key libraries that you will need to install… sklearn

  35. Tijo L. Peter December 28, 2016 at 10:34 pm #

    Best ML tutorial for Python. Thank you, Jason.

  36. baso December 29, 2016 at 12:38 am #

    when i tried run, i have error message” TypeError: Empty ‘DataFrame’: no numeric data to plot” help me

    • Jason Brownlee December 29, 2016 at 7:18 am #

      Sorry to hear that.

      Perhaps check that you have loaded the data as you expect and that the loaded values are numeric and not strings. Perhaps print the first few rows: print(df.head(5))

      • baso December 29, 2016 at 1:05 pm #

        thanks very much Jason for your time

        it worked. these tutorial very help for me. im new in Machine learning, but may you explain to me about your simple project above? because i did not see X_test and target

        regard in advance

  37. Andrea January 5, 2017 at 1:42 am #

    Thank you for sharing this. I bumped into some installation problems.
    Eventually, yo get all dependencies installed on MacOS 10.11.6 I had to run this:

    brew install python
    pip install –user numpy scipy matplotlib ipython jupyter pandas sympy nose scikit-learn
    export PATH=$PATH:~/Library/Python/2.7/bin

    • Jason Brownlee January 5, 2017 at 9:21 am #

      Thanks for sharing Andrea.

      I’m a macports guy myself, here’s my recipe:

  38. Sohib January 6, 2017 at 6:26 pm #

    Hi Jason,
    I am following this page as a beginner and have installed Anaconda as recommended.
    As I am on win 10, I installed Anaconda 4.2.0 For Windows Python 2.7 version (x64) and
    I am using Anaconda’s Spyder (python 2.7) IDE.

    I checked all the versions of libraries (as shown in 1.2 Start Python and Check Versions) and got results like below:

    Python: 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
    scipy: 0.18.1
    numpy: 1.11.1
    matplotlib: 1.5.3
    pandas: 0.18.1
    sklearn: 0.17.1

    At the 2.1 Import libraries section, I imported all of them and tried to load data as shown in
    2.2 Load Dataset. But when I run it, it doesn’t show an output, instead, there is an error:

    Traceback (most recent call last):
    File “C:\Users\gachon\.spyder\”, line 4, in
    from sklearn import model_selection
    ImportError: cannot import name model_selection

    Below is my code snippet:

    import pandas
    from import scatter_matrix
    import matplotlib.pyplot as plt
    from sklearn import model_selection
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    url = “”
    names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    dataset = pandas.read_csv(url, names=names)

    When I delete “from sklearn import model_selection” line I get expected results (150, 5).

    Am I missing something here?

    Thank you for your time and endurance!

    • Jason Brownlee January 7, 2017 at 8:23 am #

      Hi Sohib,

      You must have scikit-learn version 0.18 or higher installed.

      Perhaps Anaconda has documentation on how to update sklearn?

      • Sohib January 10, 2017 at 12:15 pm #

        Thank you for reply.

        I updated scikit-learn version to 0.18.1 and it helped.
        The error disappeared, the result is shown, but one statement

        ‘import sitecustomize’ failed; use -v for traceback

        is executed above the result.
        I tried to find out why, but apparently I might not find the reason.
        Is it going to be a problem in my further steps?
        How to solve this?

        Thank you in advance!

        • Jason Brownlee January 11, 2017 at 9:25 am #

          I’m glad to hear it fixed your problem.

          Sorry, I don’t know what “import sitecustomize” is or why you need it.

  39. Vishakha January 7, 2017 at 10:10 pm #

    Can i get the same tutorial with java

  40. Abhinav January 8, 2017 at 8:27 pm #

    Hi Jason,

    Nice tutorial.

    In univariate plots, you mentioned about gaussian distribution.

    According to the univariate plots, sepat-width had gaussian distribution. You said there are 2 variables having gaussain distribution. Please tell the other.


    • Jason Brownlee January 9, 2017 at 7:49 am #

      The distribution of the others may be multi-modal. Perhaps a double Gaussian.

  41. Thinh January 13, 2017 at 5:07 am #

    Hi, Jason. Could you please tell me the reason Why you choose KNN in example above ?

    • Jason Brownlee January 13, 2017 at 9:16 am #

      Hi Thinh,

      No reason other than it is an easy algorithm to run and understand and good algorithm for a first tutorial.

  42. Scott P January 13, 2017 at 10:25 pm #

    Hi Jason,

    I’m trying to use this code with the KDD Cup ’99 dataset, and I am having trouble with LabelEncoding my dataset in to numerical values.

    import pandas
    import numpy
    from import scatter_matrix
    import matplotlib.pyplot as plt
    from sklearn import preprocessing
    from sklearn import cross_validation
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.preprocessing import LabelEncoder
    from collections import defaultdict

    #Load KDD dataset
    data_set = “NSL-KDD/KDDTrain+.txt”
    names = [‘duration’,’protocol_type’,’service’,’flag’,’src_bytes’,’dst_bytes’,’land’,’wrong_fragment’,’urgent’,’hot’,’num_failed_logins’,’logged_in’,’num_compromised’,’su_attempted’,’num_root’,’num_file_creations’,

    #Diabetes Dataset
    #data_set = “Datasets/”
    #names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
    #data_set = “Datasets/”
    #names = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’,’class’]
    dataset = pandas.read_csv(data_set, names=names)

    array = dataset.values
    X = array[:,0:40]
    Y = array[:,40]

    label_encoder = LabelEncoder()
    label_encoder =
    label_encoded_y = label_encoder.transform(Y)

    validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, label_encoded_y, test_size=validation_size, random_state=seed)

    # Test options and evaluation metric
    num_folds = 7
    num_instances = len(X_train)
    seed = 7
    scoring = ‘accuracy’

    # Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))

    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    msg = “%s: %f (%f)” % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage

    # Compare Algorithms
    fig = plt.figure()
    fig.suptitle(‘Algorithm Comparison’)
    ax = fig.add_subplot(111)

    Am I doing something wrong with the LabelEncoding process?

  43. Dan January 14, 2017 at 4:56 am #

    Hi, I’m running a bit of a different setup than yours.

    The modules and version of python I’m using are more recent releases:

    Python: 3.5.2 |Anaconda 4.2.0 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
    scipy: 0.18.1
    numpy: 1.11.3
    matplotlib: 1.5.3
    pandas: 0.19.2
    sklearn: 0.18.1

    And I’ve gotten SVM as the best algorithm in terms of accuracy at 0.991667 (0.025000).

    Would you happen to know why this is, considering more recent versions?

    I also happened to get a rather different boxplot but I’ll leave it at what I’ve said thus far.

Leave a Reply