Your First Machine Learning Project in Python Step-By-Step

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this post, you will complete your first machine learning project using Python.

In this step-by-step tutorial you will:

  1. Download and install Python SciPy and get the most useful package for machine learning in Python.
  2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
  3. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started!

  • Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
  • Update Mar/2017: Added links to help setup your Python environment.
  • Update Apr/2018: Added some helpful links about randomness and predicting.
  • Update Sep/2018: Added link to my own hosted version of the dataset.
  • Update Feb/2019: Updated for sklearn v0.20, also updated plots.
  • Update Oct/2019: Added links at the end to additional tutorials to continue on.
  • Update Nov/2019: Added full code examples for each section.
  • Update Dec/2019: Updated examples to remove warnings due to API changes in v0.22.
  • Update Jan/2020: Updated to remove the snippet for the test harness.

Your First Machine Learning Project in Python Step-By-Step
Photo by Daniel Bernard. Some rights reserved.

How Do You Start Machine Learning in Python?

The best way to learn machine learning is by designing and completing small projects.

Python Can Be Intimidating When Getting Started

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.

There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

  • It will force you to install and start the Python interpreter (at the very least).
  • It will given you a bird’s eye view of how to step through a small project.
  • It will give you confidence, maybe to go on to your own small projects.

Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

A machine learning project may not be linear, but it has a number of well known steps:

  1. Define Problem.
  2. Prepare Data.
  3. Evaluate Algorithms.
  4. Improve Results.
  5. Present Results.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

  • Attributes are numeric so you have to figure out how to load and handle data.
  • It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
  • It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
  • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
  • All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in Python.

Machine Learning in Python: Step-By-Step Tutorial
(start here)

In this section, we are going to work through a small machine learning project end-to-end.

Here is an overview of what we are going to cover:

  1. Installing the Python and SciPy platform.
  2. Loading the dataset.
  3. Summarizing the dataset.
  4. Visualizing the dataset.
  5. Evaluating some algorithms.
  6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

If you have any questions at all, please leave a comment at the bottom of the post.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

1. Downloading, Installing and Starting Python SciPy

Get the Python and SciPy platform installed on your system if it is not already.

I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.

1.1 Install SciPy Libraries

This tutorial assumes Python version 3.6+.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

  • scipy
  • numpy
  • matplotlib
  • pandas
  • sklearn

There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.

  • On Mac OS X, you can use homebrew to install newer versions of Python 3 and these libraries. For more information on homebrew, see the homepage.
  • On Linux you can use your package manager, such as yum on Fedora to install RPMs.

If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.

Note: This tutorial assumes you have scikit-learn version 0.20 or higher installed.

Need more help? See one of these tutorials:

1.2 Start Python and Check Versions

It is a good idea to make sure your Python environment was installed successfully and is working as expected.

The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.

Open a command line and start the python interpreter:

I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.

Type or copy and paste the following script:

Here is the output I get on my OS X workstation:

Compare the above output to your versions.

Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.

If you get an error, stop. Now is the time to fix it.

If you cannot run the above script cleanly you will not be able to complete this tutorial.

My best advice is to Google search for your error message or post a question on Stack Exchange.

2. Load The Data

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

You can learn more about this dataset on Wikipedia.

In this step we are going to load the iris data from CSV file URL.

2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.

2.2 Load Dataset

We can load the data directly from the UCI Machine Learning repository.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

The dataset should load without incident.

If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.

3. Summarize the Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Peek at the data itself.
  3. Statistical summary of all attributes.
  4. Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 150 instances and 5 attributes:

3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 20 rows of the data:

3.3 Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

We can see that each class has the same number of instances (50 or 33% of the dataset).

3.5 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

  1. Univariate plots to better understand each attribute.
  2. Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots for Each Input Variable for the Iris Flowers Dataset

Box and Whisker Plots for Each Input Variable for the Iris Flowers Dataset

We can also create a histogram of each input variable to get an idea of the distribution.

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Histogram Plots for Each Input Variable for the Iris Flowers Dataset

Histogram Plots for Each Input Variable for the Iris Flowers Dataset

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Scatter Matrix Plot for Each Input Variable for the Iris Flowers Dataset

Scatter Matrix Plot for Each Input Variable for the Iris Flowers Dataset

4.3 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

  1. Separate out a validation dataset.
  2. Set-up the test harness to use 10-fold cross validation.
  3. Build multiple different models to predict species from flower measurements
  4. Select the best model.

5.1 Create a Validation Dataset

We need to know that the model we created is good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train, evaluate and select among our models, and 20% that we will hold back as a validation dataset.

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.

Notice that we used a python slice to select the columns in the NumPy array. If this is new to you, you might want to check-out this post:

5.2 Test Harness

We will use stratified 10-fold cross validation to estimate model accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.

For more on the k-fold cross-validation technique, see the tutorial:

We set the random seed via the random_state argument to a fixed number to ensure that each algorithm is evaluated on the same splits of the training dataset.

The specific random seed does not matter, learn more about pseudorandom number generators here:

We are using the metric of ‘accuracy‘ to evaluate models.

This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use.

We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s test 6 different algorithms:

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB).
  • Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.

Let’s build and evaluate our models:

5.4 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

What scores did you get?
Post your results in the comments below.

In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score at about 0.98 or 98%.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).

A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.

We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.

Box and Whisker Plot Comparing Machine Learning Algorithms on the Iris Flowers Dataset

Box and Whisker Plot Comparing Machine Learning Algorithms on the Iris Flowers Dataset

5.5 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

6. Make Predictions

We must choose an algorithm to use to make predictions.

The results in the previous section suggest that the SVM was perhaps the most accurate model. We will use this model as our final model.

Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both of these issues will result in an overly optimistic result.

6.1 Make Predictions

We can fit the model on the entire training dataset and make predictions on the validation dataset.

You might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:

You might also like to save the model to file and load it later to make predictions on new data. For examples on how to do this, see the tutorial:

6.2 Evaluate Predictions

We can evaluate the predictions by comparing them to the expected results in the validation set, then calculate classification accuracy, as well as a confusion matrix and a classification report.

We can see that the accuracy is 0.966 or about 96% on the hold out dataset.

The confusion matrix provides an indication of the errors made.

Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

6.3 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

You Can Do Machine Learning in Python

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.

Summary

In this post, you discovered step-by-step how to complete your first machine learning project in Python.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Your Next Step

Do you work through the tutorial?

  1. Work through the above tutorial.
  2. List any questions you have.
  3. Search-for or research the answers.
  4. Remember, you can use the help(“FunctionName”) in Python to get help on any function.

Do you have a question?
Post it in the comments below.

More Tutorials?

Looking to continue to practice your machine learning skills, take a look at some of these tutorials:

Discover Fast Machine Learning in Python!

Master Machine Learning With Python

Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:
Loading data, visualization, modeling, tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

See What's Inside

2,011 Responses to Your First Machine Learning Project in Python Step-By-Step

  1. Avatar
    DR Venugopala Rao Manneni June 11, 2016 at 5:58 pm #

    Awesome… But in your Blog please introduce SOM ( Self Organizing maps) for unsupervised methods and also add printing parameters ( Coefficients )code.

    • Avatar
      Jason Brownlee June 14, 2016 at 8:17 am #

      I generally don’t cover unsupervised methods like clustering and projection methods.

      This is because I mainly focus on and teach predictive modeling (e.g. classification and regression) and I just don’t find unsupervised methods that useful.

      • Avatar
        Rajesh January 21, 2018 at 5:33 pm #

        Jason,
        Can you elaborate what you don’t find unsupervised methods useful?

        • Avatar
          Jason Brownlee January 22, 2018 at 4:42 am #

          Because my focus is predictive modeling.

          • Avatar
            hamdy November 19, 2018 at 8:04 am #

            DeprecationWarning: the imp module is deprecated in favour of importlib; see the module’s documentation for alternative uses
            what is the error?

          • Avatar
            Jason Brownlee November 19, 2018 at 2:19 pm #

            You can ignore this warning for now.

          • Avatar
            Haider June 16, 2019 at 7:23 pm #

            Can you please help, where i’m doing mistake???

            # Spot Check Algorithms
            models = []
            models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
            models.append((‘LDA’, LinearDiscriminantAnalysis()))
            models.append((‘KNN’, KNeighborsClassifier()))
            models.append((‘CART’, DecisionTreeClassifier()))
            models.append((‘NB’, GaussianNB()))
            models.append((‘SVM’, SVC(gamma=’auto’)))
            # evaluate each model in turn
            results = []
            names = []
            for name, model in models:
            kfold = model_selection.KFold(n_splits=10, random_state=seed)
            cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
            results.append(cv_results)
            names.append(name)
            msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
            print(msg)

            ValueError Traceback (most recent call last)
            in
            13 for name, model in models:
            14 kfold = model_selection.KFold(n_splits=10, random_state=seed)
            —> 15 cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

            ~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
            400 fit_params=fit_params,
            401 pre_dispatch=pre_dispatch,
            –> 402 error_score=error_score)
            403 return cv_results[‘test_score’]
            404

            ~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
            238 return_times=True, return_estimator=return_estimator,
            239 error_score=error_score)
            –> 240 for train, test in cv.split(X, y, groups))
            241
            242 zipped_scores = list(zip(*scores))

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
            915 # remaining jobs.
            916 self._iterating = False
            –> 917 if self.dispatch_one_batch(iterator):
            918 self._iterating = self._original_iterator is not None
            919

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
            757 return False
            758 else:
            –> 759 self._dispatch(tasks)
            760 return True
            761

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
            714 with self._lock:
            715 job_idx = len(self._jobs)
            –> 716 job = self._backend.apply_async(batch, callback=cb)
            717 # A job can complete so quickly than its callback is
            718 # called before we get here, causing self._jobs to

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
            180 def apply_async(self, func, callback=None):
            181 “””Schedule a func to be run”””
            –> 182 result = ImmediateResult(func)
            183 if callback:
            184 callback(result)

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
            547 # Don’t delay the application, to avoid keeping the input
            548 # arguments in memory
            –> 549 self.results = batch()
            550
            551 def get(self):

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
            223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
            224 return [func(*args, **kwargs)
            –> 225 for func, args, kwargs in self.items]
            226
            227 def __len__(self):

            ~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in (.0)
            223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
            224 return [func(*args, **kwargs)
            –> 225 for func, args, kwargs in self.items]
            226
            227 def __len__(self):

            ~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
            526 estimator.fit(X_train, **fit_params)
            527 else:
            –> 528 estimator.fit(X_train, y_train, **fit_params)
            529
            530 except Exception as e:

            ~\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
            1284 X, y = check_X_y(X, y, accept_sparse=’csr’, dtype=_dtype, order=”C”,
            1285 accept_large_sparse=solver != ‘liblinear’)
            -> 1286 check_classification_targets(y)
            1287 self.classes_ = np.unique(y)
            1288 n_samples, n_features = X.shape

            ~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
            169 if y_type not in [‘binary’, ‘multiclass’, ‘multiclass-multioutput’,
            170 ‘multilabel-indicator’, ‘multilabel-sequences’]:
            –> 171 raise ValueError(“Unknown label type: %r” % y_type)
            172
            173

            ValueError: Unknown label type: ‘continuous’

          • Avatar
            Vaisakh Nair January 5, 2022 at 6:14 pm #

            Thanks jason ur teachings r really helpful more power to u thanks a ton…learning lots of predictive modelling from ur pages!!!

          • Avatar
            James Carmichael January 6, 2022 at 10:51 am #

            Thank you for your kind words and feedback, Vaisakh!

        • Avatar
          Rasmi Bhattarai June 3, 2020 at 4:16 pm #

          RandomForestClassifier : 1.0

      • Avatar
        Aishwarya April 11, 2018 at 1:49 pm #

        I got quite different results though i used same seed and splits

        Svm : 0.991667 (0.025) with highest accuracy
        KNN : 0.9833
        CART : 0.9833
        Why ?

        • Avatar
          Aishwarya April 11, 2018 at 1:59 pm #

          Im getting error saying

          Cannot perform reduce with flexible type

          While comparing algos using boxplots

          • Avatar
            Jason Brownlee April 11, 2018 at 4:26 pm #

            Sorry, I have not seen this error before. Are you able to confirm that your environment is up to date?

          • Avatar
            Ycyusa August 5, 2018 at 9:31 am #

            I followed your steps and I got the similar result as Aishwarya

            SVM: 0.991667 (0.025000)
            KNN: 0.983333 (0.033333)
            CART: 0.975000 (0.038188)

        • Avatar
          Jason Brownlee April 11, 2018 at 4:25 pm #

          The API may have changed since I wrote this post. This in turn may have resulted in small changes in predictions that are perhaps not statistically significant.

          • Avatar
            Aishwarya April 11, 2018 at 10:50 pm #

            Ive done this on kaggle.
            Under ML kernal

            http://Www.kaggle.com/aishuvenkat09

          • Avatar
            Aishwarya April 11, 2018 at 10:54 pm #

            Sorry

            http://Www.kaggle.com/aishwarya09

          • Avatar
            Jason Brownlee April 12, 2018 at 8:43 am #

            Well done!

          • Avatar
            manohar April 23, 2018 at 6:49 pm #

            Hi ,
            I have same issues with above our friends discussed
            LR: 0.966667 (0.040825)
            LDA: 0.975000 (0.038188)
            KNN: 0.983333 (0.033333)
            CART: 0.983333 (0.033333)
            NB: 0.975000 (0.053359)
            SVM: 0.991667 (0.025000)

            In that svm has more accuracy when comapre to rest
            so i go ahead svm

          • Avatar
            Jason Brownlee April 24, 2018 at 6:26 am #

            Yes.

        • Avatar
          Ali May 10, 2018 at 8:58 am #

          Yes. I got the same. Dr. Jason had mentioned that results might vary.

        • Avatar
          Sai Prasad September 14, 2018 at 5:08 pm #

          I also have the same result.
          LR: 0.966667 (0.040825)
          LDA: 0.975000 (0.038188)
          KNN: 0.983333 (0.033333)
          CART: 0.983333 (0.033333)
          NB: 0.975000 (0.053359)
          SVM: 0.991667 (0.025000)

      • Avatar
        bharat May 19, 2018 at 9:45 pm #

        cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

        sir i am getting error in this in of code.What should i do?

        • Avatar
          Jason Brownlee May 20, 2018 at 6:38 am #

          What error?

          • Avatar
            sawsen November 12, 2019 at 8:38 pm #

            File “”, line 1, in
            NameError: name ‘model’ is not defined

          • Avatar
            Jason Brownlee November 13, 2019 at 5:40 am #

            Looks like you may have missed a few lines of code.

            Perhaps try copy-pasting the complete example at the end of each section?

        • Avatar
          AVNEESH UPADHAYAY June 25, 2018 at 5:00 am #

          I think cv may be equal to the number of times you want to perform k-fold cross validation for e.g. 10,20etc. and in scoring parameter, you need to mention which type of scoring parameter you want to use for example ‘accuracy’.
          Hope this might help….

        • Avatar
          Ved Anshu September 21, 2018 at 4:20 pm #

          Bro kindly use train_test_split() in the place of model_selection

        • Avatar
          David H. October 17, 2019 at 10:36 am #

          Try this
          cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=None)

          It worked for me!

        • Avatar
          Bibhu Das December 11, 2019 at 12:57 am #

          put the kfold = , and cv_results = , part inside the for loop it will work fine.

      • Avatar
        Mohammed March 25, 2019 at 2:54 pm #

        thank you so much really its very useful

        in the last step you are used KNN to make predictions why you are used KNN can we use SVM
        and can we make compare with all the models in predictions ?

        • Avatar
          Jason Brownlee March 26, 2019 at 7:58 am #

          It is just an example, you can make predictions with any model you wish.

          Often we prefer simpler models (like knn) over more complex models (like svm).

      • Avatar
        TAPSOBA Abdou March 20, 2020 at 11:17 pm #

        Hi Jason
        I followed your steps but I’m getting error. What should I do? Best regards
        >>> # Spot Check Algorithms
        … models = []
        >>> models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
        >>> models.append((‘LDA’, LinearDiscriminantAnalysis()))
        >>> models.append((‘KNN’, KNeighborsClassifier()))
        >>> models.append((‘CART’, DecisionTreeClassifier()))
        >>> models.append((‘NB’, GaussianNB()))
        >>> models.append((‘SVM’, SVC(gamma=’auto’)))
        >>> # evaluate each model in turn
        … results = []
        >>> names = []
        >>> for name, model in models:
        … kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
        File “”, line 2
        kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
        ^
        IndentationError: expected an indented block
        >>> cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’)
        Traceback (most recent call last):
        File “”, line 1, in
        NameError: name ‘model’ is not defined
        >>> results.append(cv_results)
        Traceback (most recent call last):
        File “”, line 1, in
        NameError: name ‘cv_results’ is not defined

      • Avatar
        Dario Gomez January 3, 2021 at 3:25 pm #

        Could you elaborate a bit more about the difference between prediction and projection?

        For example I got a data set that I collected throughout a year, and I would like to predict/project what will happen next year.

      • Avatar
        Shantanu Bhayre March 22, 2021 at 3:27 am #

        sir i want to work on crop prices data for crop price pridiction project for my minor project but the crop price data does not find plese help me sir and send me crop price csv file link

      • Avatar
        Sophie May 4, 2021 at 4:39 am #

        Hello Jason,
        Thank you for this amazing tutorial, it helped me to gain confidence:
        Please see my results:
        LR: 0.941667 (0.065085)
        LDA: 0.975000 (0.038188)
        KNN: 0.958333 (0.041667)
        NB: 0.950000 (0.055277)
        SVM: 0.983333 (0.033333)

        predictions: [‘Iris-setosa’ ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’
        ‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-virginica’ ‘Iris-setosa’
        ‘Iris-setosa’ ‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-setosa’
        ‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’
        ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’ ‘Iris-setosa’
        ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-virginica’ ‘Iris-setosa’
        ‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-setosa’ ‘Iris-setosa’
        ‘Iris-versicolor’ ‘Iris-virginica’]
        0.9666666666666667
        [[11 0 0]
        [ 0 12 1]
        [ 0 0 6]]
        precision recall f1-score support

        Iris-setosa 1.00 1.00 1.00 11
        Iris-versicolor 1.00 0.92 0.96 13
        Iris-virginica 0.86 1.00 0.92 6

        accuracy 0.97 30
        macro avg 0.95 0.97 0.96 30
        weighted avg 0.97 0.97 0.97 30

      • Avatar
        Stone Bridge August 10, 2021 at 4:24 pm #

        The program runs through, but the calculated result is that CART and SVM have the highest accuracy
        LR: 0.966667 (0.040825)
        LDA: 0.975000 (0.053359)
        KNN: 0.983333 (0.050000)
        CART: 0.991667 (0.025000)
        NB: 0.975000 (0.038188)
        SVM: 0.991667 (0.025000)

        • Avatar
          Adrian Tam August 11, 2021 at 6:39 am #

          Nice work. Thanks.

    • Avatar
      Hasnain July 8, 2017 at 8:55 pm #

      I have installed all libraries that were in your How to Setup Python environment… blog. All went fine but when i run the starting imports code I get error at first line “ModuleNotFoundError: No module named ‘pandas'”. But I did installl it using “pip install pandas” command. I am working on a windows machine.

      • Avatar
        Jason Brownlee July 9, 2017 at 10:53 am #

        Sorry to hear that. Consider rebooting your machine?

        • Avatar
          Sheila Dawn August 9, 2017 at 5:43 am #

          I had the same problem initially, because I made 2 python files.. one for loading the libraries, and another for loading the iris dataset.

          Then I decided to put the two commands in one python file, it solved problem. 🙂

          • Avatar
            Jason Brownlee August 9, 2017 at 6:43 am #

            Yes, all commands go in the one file. Sorry for the confusion.

      • Avatar
        Dan Fiorino July 16, 2017 at 2:37 am #

        Hasnain, try setting the environment variable PYTHON_PATH and PATH to include the path to the site packages of the version of python you have permission to alter

        export PYTHONPATH=”$PYTHONPATH:/path/to/Python/2.7/site-packages/”
        export PATH=”$PATH:/path/to/Python/2.7/site-packages/”

        obviously replacing “/path/to” with the actual path. My system Python is in my /Users//Library folder but I’m on a Mac.

        You can add the export lines to a script that runs when you open a terminal (“~/.bash_profile” if you use BASH).

        That might not be 100% right, but it should help you on your way.

        • Avatar
          Jason Brownlee July 16, 2017 at 8:00 am #

          Thanks for posting the tip Dan, I hope it helps.

          • Avatar
            Jason Robinette September 7, 2017 at 11:16 am #

            got it to work have no idea how but it worked! I am like the kid at t-ball that closes his eyes and takes a swing!

          • Avatar
            Jason Brownlee September 7, 2017 at 12:58 pm #

            I’m glad to hear that!

      • Avatar
        Tanya September 30, 2017 at 11:08 am #

        I am starting at square 0, and after clearing a first few hurdles, I was not even able to install the libraries at all… (as a newb), I didn’t see where I even GO to import this:
        # Load libraries
        import pandas
        from pandas.tools.plotting import scatter_matrix
        import matplotlib.pyplot as plt
        from sklearn import model_selection
        from sklearn.metrics import classification_report
        from sklearn.metrics import confusion_matrix
        from sklearn.metrics import accuracy_score
        from sklearn.linear_model import LogisticRegression
        from sklearn.tree import DecisionTreeClassifier
        from sklearn.neighbors import KNeighborsClassifier
        from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
        from sklearn.naive_bayes import GaussianNB
        from sklearn.svm import SVC

        • Avatar
          Jason Brownlee October 1, 2017 at 9:04 am #

          Perhaps this step-by-step tutorial will help you set up your environment:
          https://machinelearningmastery.mystagingwebsite.com/setup-python-environment-machine-learning-deep-learning-anaconda/

        • Avatar
          KASINATH PS December 7, 2017 at 8:16 pm #

          if u r using python 3

          save all the commands as a py file
          then in a pythin shell enter

          exec(open(“[path to file with name]”).read())

          if u open shell in the same path as the saved thing
          then u only need to enter the filename alone

          ex:
          lets say i saved it as load.py

          then

          exec(open(“load.py”).read())

          this will execute all commands in the current shell

        • Avatar
          Rahul December 7, 2017 at 10:28 pm #

          Hi Tanya,
          This tutorial is so intuitive that I went through this tutorial with a breeze.
          Install PyCharm from JetBrains available here https://www.jetbrains.com/pycharm/download/download-thanks.html?platform=windows&code=PCC
          Install PIP (The de-facto python package manager) and then click “Terminal” in PyCharm to bring up the interactive DOS like terminal. Once you have installed PIP then there you can issue the following commands:
          pip install numpy
          pip install scipy
          pip install matplotlib
          pip install pandas
          pip install sklearn
          All other steps in the tutorial are valid and do not need a single line of change apart from where its mentioned

          from pandas.tools.plotting import scatter_matrix , change it to

          from pandas.plotting import scatter_matrix

          • Avatar
            Jason Brownlee December 8, 2017 at 5:39 am #

            Thanks for the tips Rahul.

          • Avatar
            Murtaza December 17, 2017 at 11:05 am #

            For a beginner i believe Anacondas Jupyter notebooks would be the best option. As they can include markdown for future reference which is essential as beginner (backpropogation :p). But again varies person to person

          • Avatar
            Jason Brownlee December 18, 2017 at 5:19 am #

            I find notebooks confuse beginners more than help.

            Running a Python script on the command line is so much simpler.

          • Avatar
            Jason March 1, 2018 at 4:18 pm #

            Except for me, on Debian Stretch with pandas 0.19.2, I had to use

            from pandas.tools.plotting import scatter_matrix

          • Avatar
            Jason Brownlee March 2, 2018 at 5:30 am #

            You must update your version of Pandas.

        • Avatar
          avanish March 25, 2018 at 7:11 pm #

          use jupyter notebook …there all the essential libraries are preinstalled

        • Avatar
          Anmoldeep1509 October 31, 2018 at 6:50 am #

          I also did a similar mistake, I am also a newbie to python, and wrote those import statements in the separate file, and imported the created file, without knowing how imports work…after your reply realized my mistake and now back on track thanks!

      • Avatar
        Tushar June 22, 2018 at 4:50 am #

        I also had problems installing modules on windows. Although, there was no error of any kind if installed from PyCharm IDE.
        Also, use 32-bit python interpreter if you wanna use NLTK. It can be done even on 64-bit version, but was not worth the time it would it need.

      • Avatar
        Karan sing March 26, 2019 at 8:28 pm #

        If you are working on virtual environment then you have to make script first and run it by activating the virtual environment,
        If you are not working on virtual environment then run your scripts on time

    • Avatar
      Yuvraj July 13, 2018 at 1:56 am #

      Could you please go into the mathematical concept behind KNN and why the accuracy resulted in the highest score? Thank you

    • Avatar
      Mario October 4, 2018 at 8:13 pm #

      I like your tutorial for the machine learning in python but at this moment I am stuck. Here is where I am
      # Compare Algorithms
      fig = plt.figure()
      fig.suptitle(‘Algorithm Comparison’)
      ax = fig.add_subplot(111)
      plt.boxplot(results)
      ax.set_xticklabels(names)
      plt.show()

      This is the answer I am getting from it

      TypeError Traceback (most recent call last)
      in ()
      3 fig.suptitle(‘Algorithm Comparison’)
      4 ax = fig.add_subplot(111)
      —-> 5 plt.boxplot(results)
      6 ax.set_xticklabels(names)
      7 plt.show()

      ~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder, hold, data)
      2846 whiskerprops=whiskerprops,
      2847 manage_xticks=manage_xticks, autorange=autorange,
      -> 2848 zorder=zorder, data=data)
      2849 finally:
      2850 ax._hold = washold

      ~\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
      1853 “the Matplotlib list!)” % (label_namer, func.__name__),
      1854 RuntimeWarning, stacklevel=2)
      -> 1855 return func(ax, *args, **kwargs)
      1856
      1857 inner.__doc__ = _add_data_doc(inner.__doc__,

      ~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder)
      3555
      3556 bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
      -> 3557 labels=labels, autorange=autorange)
      3558 if notch is None:
      3559 notch = rcParams[‘boxplot.notch’]

      ~\Anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
      1839
      1840 # arithmetic mean
      -> 1841 stats[‘mean’] = np.mean(x)
      1842
      1843 # medians and quartiles

      ~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in mean(a, axis, dtype, out, keepdims)
      2955
      2956 return _methods._mean(a, axis=axis, dtype=dtype,
      -> 2957 out=out, **kwargs)
      2958
      2959

      ~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
      68 is_float16_result = True
      69
      —> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
      71 if isinstance(ret, mu.ndarray):
      72 ret = um.true_divide(

      TypeError: cannot perform reduce with flexible type

      HOW CAN I FIX THIS?

      • Avatar
        Jason Brownlee October 5, 2018 at 5:33 am #

        Perhaps post your code and error to stackoverflow.com?

      • Avatar
        Brandon January 23, 2019 at 4:37 pm #

        I also got a traceback on this section:
        TypeError: cannot perform reduce with flexible type

        Quick check on stackoverflow show’s that plt.boxplot() cannot accept strings. Personally, I had an error in section 5.4 line 15.

        Wrong code: results.append(results)
        Coorect: resilts.append(cv_results)

        woohoo for tracebacks and wrong data-types. Hope someone finds this helpful.

        • Avatar
          Jason Brownlee January 24, 2019 at 6:40 am #

          Are you able to confirm that your python libraries are up to date?

    • Avatar
      Ademola November 27, 2018 at 7:49 am #

      Well done

    • Avatar
      Meca April 1, 2021 at 12:38 am #

      Thank you sir!

  2. Avatar
    Jan de Lange June 20, 2016 at 10:43 pm #

    Nice work Jason. Of course there is a lot more to tell about the code and the Models applied if this is intended for people starting out with ML (like me). Rather than telling which “button to press” to make work, it would be nice to know why also. I looked at a sample of you book (advanced) if you are covering the why also, but it looks like it’s limited?

    On this particular example, in my case SVM reached 99.2% and was thus the best Model. I gather this is because the test and training sets are drawn randomly from the data.

    • Avatar
      Jason Brownlee June 21, 2016 at 7:04 am #

      This tutorial and the book are laser focused on how to use Python to complete machine learning projects.

      They already assume you know how the algorithms work.

      If you are looking for background on machine learning algorithms, take a look at this book:
      https://machinelearningmastery.mystagingwebsite.com/master-machine-learning-algorithms/

      • Avatar
        Alan July 26, 2017 at 10:50 pm #

        Jan de Lange and Jason,

        Before anything else, I truly like to thank Jason for this wonderful, concise and practical guideline on using ML for solving a predictive problem.

        In terms of the example you have provided, I can confirm ‘Jan de Lange’ ‘s outcome. I’ve got the same accuracy result for SVM (0.991667 to be precise). I’ve just upgraded the Canopy version I had installed on my machine to version 2.1.3.3542 (64 bit) and your reasoning makes sense that this discrepancy could be because of its random selection of data. But this procedure could open up a new ‘can of warm’ as some say. since the selection of best model is on the line.

        Thank you again Jason for this practical article on ML.

    • Avatar
      Per December 15, 2017 at 7:36 pm #

      Got it working too, changing the scatter_matrix import like Rahul did.
      But I also had to install tkinter first (yum install tkinter).

      Very nice tutorial, Jason!

  3. Avatar
    Nil June 25, 2016 at 12:42 am #

    Awesome, I have tested the code it is impressive. But how could I use the model to predict if it is Iris-setosa or Iris-versicolor or Iris-virginica when I am given some values representing sepal-length, sepal-width, petal-length and petal-width attributes?

    • Avatar
      Jason Brownlee June 25, 2016 at 5:09 am #

      Great question. You can call model.predict() with some new data.

      For an example, see Part 6 in the above post.

      • Avatar
        JamieFox March 28, 2017 at 6:38 am #

        Dear Jason Brownlee, I was thinking about the same question of Nil. To be precise I was wondering how can I know, after having seen that my model has a good fit, which values of sepal-length, sepal-width, petal-length and petal-width corresponds to Iris-setosa eccc..
        For instance, if I have p predictors and two classes, how can I know which values of the predictors blend to one class or the other. Knowing the value of predictors allows me to use the model in the daily operativity. Thx

        • Avatar
          Jason Brownlee March 28, 2017 at 8:27 am #

          Not knowing the statistical relationship between inputs and outputs is one of the down sides of using neural networks.

          • Avatar
            JamieFox March 29, 2017 at 7:03 am #

            Hi Mr Jason Brownlee, thks for your answer. So all algorithms, such as SVM, LDA, random forest.. have this drawbacks? Can you suggest me something else?
            Because logistic regression is not like this, or am I wrong?

          • Avatar
            Jason Brownlee March 29, 2017 at 9:14 am #

            All algorithms have limitations and assumptions. For example, Logistic Regression makes assumptions about the distribution of variates (Gaussian) and more:
            https://en.wikipedia.org/wiki/Logistic_regression

            Nevertheless, we can make useful models (skillful) even when breaking assumptions or pushing past limitations.

  4. Avatar
    Sujon September 6, 2016 at 8:19 am #

    Dear Sir,

    It seems I’m in the right place in right time! I’m doing my master thesis in machine learning from Stockholm University. Could you give me some references for laughter audio conversation to CSV file? You can send me anything on [email protected]. Thanks a lot and wish your very best and will keep in touch.

  5. Avatar
    Sujon September 6, 2016 at 8:32 am #

    Sorry I mean laughter audio to CSV conversion.

    • Avatar
      Jason Brownlee September 6, 2016 at 9:49 am #

      Sorry, I have not seen any laughter audio to CSV conversion tools/techniques.

      • Avatar
        Sujon May 10, 2017 at 1:02 pm #

        Hi again, do you have any publication of this article “Your First Machine Learning Project in Python Step-By-Step”? Or any citation if you know? Thanks.

  6. Avatar
    Roberto U September 19, 2016 at 9:17 am #

    Sweet way of condensing monstrous amount of information in a one-way street. Thanks!

    Just a small thing, you are creating the Kfold inside the loop in the cross validation. Then, you use the same seed to keep the comparison across predictors constant.

    That works, but I think it would be better to take it out of the loop. Not only is more efficient, but it is also much immediately clearer that all predictors are using the same Kfold.

    You can still justify the use of the seeds in terms of replicability; readers getting the same results on their machines.

    Thanks again!

  7. Avatar
    Francisco September 20, 2016 at 2:02 am #

    Hello Jaso.
    Thank you so much for your help with Machine Learning and congratulations for your excellent website.

    I am a beginner in ML and DeepLearning. Should I download Python 2 or Python 3?

    Thank you very much.

    Francisco

    • Avatar
      Jason Brownlee September 20, 2016 at 8:33 am #

      I use Python 2 for all my work, but my students report that most of my examples work in Python 3 with little change.

  8. Avatar
    ShawnJ October 11, 2016 at 5:24 am #

    Jason,

    Thank you so much for putting this together. I am been a software developer for almost two decades and am getting interested in machine learning. Found this tutorial accurate, easy to follow and very informative.

  9. Avatar
    Wendy G October 14, 2016 at 5:37 am #

    Jason,

    Thanks for the great post! I am trying to follow this post by using my own dataset, but I keep getting this error “Unknown label type: array ([some numbers from my dataset])”. So what’s the problem on earth, any possible solutions?

    Thanks,

    • Avatar
      Jason Brownlee October 14, 2016 at 9:08 am #

      Hi Wendy,

      Carefully check your data. Maybe print it on the screen and inspect it. You may have some string values that you may need to convert to numbers using data preparation.

  10. Avatar
    fara October 20, 2016 at 7:15 am #

    hi thanks for great tutorial, i’m also new to ML…this really helps but i was wondering what if we have non-numeric values? i have mixture of numeric and non-numeric data and obviously this only works for numeric. do you also have a tutorial for that or would you please send me a source for it? thank you

    • Avatar
      Jason Brownlee October 20, 2016 at 8:41 am #

      Great question fara.

      We need to convert everything to numeric. For categorical values, you can convert them to integers (label encoding) and then to new binary features (one hot encoding).

  11. Avatar
    Mazhar Dootio October 23, 2016 at 9:14 pm #

    Hello Jason
    Thank you for publishing this great machine learning tutorial.
    It is really awesome awesome awesome………..!
    I test your tutorial on python-3 and it works well but what I face here is to load my data set from my local drive. I followed your give instructions but couldn’t be successful.
    My syntax is as under:

    import unicodedata
    url = open(r’C:\Users\mazhar\Anaconda3\Lib\site-packages\sindhi2.csv’, encoding=’utf-8′).readlines()
    names = [‘class’, ‘sno’, ‘gender’, ‘morphology’, ‘stem’,’fword’]
    dataset = pandas.read_csv(url, names=names)

    python-3 jupyter notebook does not loads this. Kindly help me in regard.

  12. Avatar
    Mazhar Dootio October 25, 2016 at 3:22 am #

    Dear Jason
    Thank you for response
    I am using Python 3 with anaconda jupyter notebook
    so which python version you would like to suggest me and kindly write here syntax of opening local dataset file from local drive that how can I load utf-8 dataset file from my local drive.

    • Avatar
      Jason Brownlee October 25, 2016 at 8:32 am #

      Hi Mazhar, I teach using Python 2.7 with examples from the command line.

      Many of my students report that the code works in Python 3 and in notebooks with little or no changes.

    • Avatar
      Kenny October 11, 2017 at 3:50 am #

      try with this command:

      df = pd.read_csv(file, encoding=’latin-1′) #if you are working with csv “,” or “;” put sep=’|’,

  13. Avatar
    Andy October 27, 2016 at 11:59 pm #

    Great tutorial but perhaps I’m missing something here. Let’s assume I already know what model to use (perhaps because I know the data well… for example).

    knn = KNeighborsClassifier()
    knn.fit(X_train, Y_train)

    I then use the models to predict:
    print(knn.predict(an array of variables of a record I want to classify))

    Is this where the whole ML happens?
    knn.fit(X_train, Y_train)

    What’s the difference between this and say a non ML model/algorithm? Is it that in a non ML model I have to find the coefficients/parameters myself by statistical methods?; and in the ML model the machine does that itself?
    If this is the case then to me it seems that a researcher/coder did most of the work for me and wrap it in a nice function. Am I missing something? What is special here?

    • Avatar
      Jason Brownlee October 28, 2016 at 9:14 am #

      Hi Andy,

      Yes, your comment is generally true.

      The work is in the library and choice of good libraries and training on how to use them well on your project can take you a very long way very quickly.

      Stats is really about small data and understanding the domain (descriptive models). Machine learning, at least in common practice, is leaning towards automation with larger datasets and making predictions (predictive modeling) at the expense of model interpretation/understandability. Prediction performance trumps traditional goals of stats.

      Because of the automation, the focus shifts more toward data quality, problem framing, feature engineering, automatic algorithm tuning and ensemble methods (combining predictive models), with the algorithms themselves taking more of a backseat role.

      Does that make sense?

      • Avatar
        Andy November 3, 2016 at 10:36 pm #

        It does make sense.
        You mentioned ‘data quality’. That’s currently my field of work. I’ve been doing this statistically until now, and very keen to try a different approach. As a practical example how would you use ML to spot an error/outlier using ML instead of stats?
        Let’s say I have a large dataset containing trees: each tree record contains a specie, height, location, crown size, age, etc… (ah! suspiciously similar to the iris flowers dataset 🙂 Is ML a viable method for finding incorrect data and replace with an “estimated” value? The answer I guess is yes. For species I could use almost an identical method to what you presented here; BUT what about continuous values such as tree height?

        • Avatar
          Jason Brownlee November 4, 2016 at 9:08 am #

          Hi Andy,

          Maybe “outliers” are instances that cannot be easily predicted or assigned ambiguous predicted probabilities.

          Instance values can be “fixed” by estimating new values, but whole instance can also be pulled out if data is cheap.

  14. Avatar
    Shailendra Khadayat October 30, 2016 at 2:23 pm #

    Awesome work Jason. This was very helpful and expect more tutorials in the future.

    Thanks.

  15. Avatar
    Shuvam Ghosh November 16, 2016 at 12:13 am #

    Awesome work. Students need to know how the end results will look like. They need to get motivated to learn and one of the effective means of getting motivated is to be able to see and experience the wonderful end results. Honestly, if i were made to study algorithms and understand them i would get bored. But now since i know what amazing results they give, they will serve as driving forces in me to get into details of it and do more research on it. This is where i hate the orthodox college ways of teaching. First get the theory right then apply. No way. I need to see things first to get motivated.

    • Avatar
      Jason Brownlee November 16, 2016 at 9:29 am #

      Thanks Shuvam,

      I’m glad my results-first approach gels with you. It’s great to have you here.

  16. Avatar
    Puneet November 17, 2016 at 12:08 am #

    Thanks Jason,

    while i am trying to complete this.

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    print(msg)

    showing below error.-

    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    ^
    IndentationError: expected an indented block-

  17. Avatar
    Puneet November 17, 2016 at 12:30 am #

    Thanks Json,

    I am new to ML. need your help so i can run this.

    as i have followed the steps but when trying to build and evalute 5 model using this.

    —————————————-
    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    print(msg)
    ————————————————————————————————

    facing below mentioned issue.
    File “”, line 13
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    ^
    IndentationError: expected an indented block

    —————————————
    Kindly help.

    • Avatar
      Martin November 18, 2016 at 5:18 am #

      Puneet, you need to indent the block (tab or four spaces to the right). That is the way of building a block in Python

      • Avatar
        Casey December 2, 2018 at 3:58 am #

        I am also having this problem, I have indented the code as instructed but nothing executes. It seems to be waiting for more input. I have googled different script endings but nothing happens. Is there something I am missing to execute this script?

        >>> for name, model in models:
        … kfold = model_selection.KFold(n_splits=10, random_state=seed)
        … cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
        … results.append(cv_results)
        … names.append(name)
        … msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
        … print(msg)

  18. Avatar
    george soilis November 17, 2016 at 10:00 pm #

    just another Python noob here,sending many regards and thanks to Jason :):)

  19. Avatar
    sergio November 22, 2016 at 3:29 pm #

    Does this tutorial work with other data sets? I’m trying to work on a small assignment and I want to use python

    • Avatar
      Jason Brownlee November 23, 2016 at 8:50 am #

      It should provide a great template for new projects sergio.

      • Avatar
        Brian February 28, 2018 at 4:10 am #

        I tried to use another dataset. I am not sure what I imported, but even after changing the names, I still get the petal stuff as output. All of it. I commented out that part of the code and even then it gives me those old outputs.

  20. Avatar
    Albert November 26, 2016 at 1:55 am #

    Very Awesome step by step for me ! Even I am beginner of python , this gave me many things about Machine learning ~ supervised ML. Appreciate of your sharing !!

  21. Avatar
    Umar Yusuf November 27, 2016 at 4:04 am #

    Thank you for the step by step instructions. This will go along way for newbies like me getting started with machine learning.

    • Avatar
      Jason Brownlee November 27, 2016 at 10:21 am #

      You’re welcome, I’m glad you found the post useful Umar.

      • Avatar
        Shiva Andure March 18, 2019 at 3:08 pm #

        Hello Jason,

        from __future__ import division
        models = []
        models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
        models.append((‘LDA’, LinearDiscriminantAnalysis()))
        models.append((‘KNN’, KNeighborsClassifier()))
        models.append((‘CART’, DecisionTreeClassifier()))
        models.append((‘NB’, GaussianNB()))
        models.append((‘SVM’, SVC(gamma=’auto’)))
        # evaluate each model in turn
        results = []
        names = []
        for name, model in models:
        kfold = model_selection.KFold(n_splits=10, random_state=seed)
        cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
        results.append(cv_results)
        names.append(name)
        msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
        print(msg)

        I am getting erroe of ” ZeroDivisionError: float division by zero”

  22. Avatar
    Mike P November 30, 2016 at 6:29 pm #

    Hi Jason,

    Really nice tutorial. I had one question which has had me confused. Once you chose your best model, (in this instance KNN) you then train a new model to be used to make predictions against the validation set. should one not perform K-fold cross-validation on this model to ensure we don’t overfit?

    if this is correct how would you implement this, from my understanding cross_val_score will not allow one to generate a confusion matrix.

    I think this is the only thing that I have struggled with in using scikit learn if you could help me it would be much appreciated?

    • Avatar
      Jason Brownlee December 1, 2016 at 7:26 am #

      Hi Mike. No.

      Cross-validation is just a method to estimate the skill of a model on new data. Once you have the estimate you can get on with things, like confirming you have not fooled yourself (hold out validation dataset) or make predictions on new data.

      The skill you report is the cross val skill with the mean and stdev to give some idea of confidence or spread.

      Does that make sense?

      • Avatar
        Mike December 2, 2016 at 1:30 am #

        Hi Jason,

        Thanks for the quick response. So to make sure I understand, one would use cross validation to get a estimate of the skill of a model (mean of cross val scores) or chose the correct hyper parameters for a particular model.

        Once you have this information you can just go ahead and train the chosen model with the full training set and test it against the validation set or new data?

        • Avatar
          Jason Brownlee December 2, 2016 at 8:17 am #

          Hi Mike. Correct.

          Additionally, if the validation result confirms your expectations, you can go ahead and train the model on all data you have including the validation dataset and then start using it in production.

          This is a very important topic. I think I’ll write a post about it.

  23. Avatar
    Sahana Venkatesh November 30, 2016 at 8:15 pm #

    This is amazing 🙂 You boosted my morale

  24. Avatar
    Jhon November 30, 2016 at 8:27 pm #

    Hi
    while doing data visualization and running commands dataset.plot(……..) i am having the following error.kindly tell me how to fix it

    array([[,
    ],
    [,
    ]], dtype=object)

    • Avatar
      Jason Brownlee December 1, 2016 at 7:28 am #

      Looks like no data Jhon. It also looks like it’s printing out an object.

      Are you running in a notebook or on the command line? The code was intended to be run directly (e.g. command line).

  25. Avatar
    Brendon A. Kay December 1, 2016 at 4:20 am #

    Hi Jason,

    Great tutorial. I am a developer with a computer science degree and a heavy interest in machine learning and mathematics, although I don’t quite have the academic background for the latter except for what was required in college. So, this website has really sparked my interest as it has allowed me to learn the field in sort of the “opposite direction”.

    I did notice when executing your code that there was a deprecation warning for the sklearn.cross_validation module. They recommend switching to sklearn.model_selection.

    When switching the modules I adjusted the following line…

    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

    to…

    kfold = model_selection.KFold(n_folds=num_folds, random_state=seed)

    … and it appears to be working okay. Of course, I had switched all other instances of cross_validation as well, but it seemed to be that the KFold() method dropped the n (number of instances) parameter, which caused a runtime error. Also, I dropped the num_instances variable.

    I could have missed something here, so please let me know if this is not a valid replacement, but thought I’d share!

    Once again, great website!

    • Avatar
      Jason Brownlee December 1, 2016 at 7:33 am #

      Thanks for the support and the kind words Brendon. I really appreciate it (you made my day!)

      Yes, the API has changed/is changing and your updates to the tutorial look good to me, except I think n_folds has become n_splits.

      I will update this example for the new API very soon.

  26. Avatar
    Sergio December 1, 2016 at 3:41 pm #

    I’m still having a little trouble understanding step 5.1. I’m trying to apply this tutorial to a new data set but, when I try to evaluate the models from 5.3 I don’t get a result.

    • Avatar
      Jason Brownlee December 2, 2016 at 8:13 am #

      What is the problem exactly Sergio?

      Step 5.1 should create a validation dataset. You can confirm the dataset by printing it out.

      Step 5.3 should print the result of each algorithm as it is trained and evaluated.

      Perhaps check for a copy-paste error or something?

      • Avatar
        sergio December 2, 2016 at 9:13 am #

        Does this tutorial work the exact same way for other data sets? because I’m not using the Hello World dataset

        • Avatar
          Jason Brownlee December 3, 2016 at 8:23 am #

          The project template is quite transferable.

          You will need to adapt it for your data and for the types of algorithms you want to test.

  27. Avatar
    Jean-Baptiste Hubert December 11, 2016 at 12:17 am #

    Hi Sir,
    Thank you for the information.
    I am currently a student, in Engineering school in France.
    I am working on date mining project, indeed, I have a many date ( 40Go ) about the price of the stocks of many companies in the CAC40.
    My goal is to predict the evolution of the yields and I think that Neural Network could be useful.
    My idea is : I take for X the yields from “t=0” to “t=n” and for Y the yields from “t=1 to t=n” and the program should find a relation between the data.
    Is that possible ? Is it a good way in order to predict the evolution of the yield ?
    Thank you for your time
    Hubert
    Jean-Baptiste

  28. Avatar
    Ernest Bonat December 15, 2016 at 5:33 pm #

    Hi Jason,

    If I include an new item in the models array as:

    models.append((‘LNR – Linear Regression’, LinearRegression()))

    with the library:

    from sklearn.linear_model import LinearRegression

    I got an error in the \sklearn\utils\validation.py”, line 529, in check_X_y
    y = y.astype(np.float64)

    as:

    ValueError: could not convert string to float: ‘Iris-setosa’

    Let me know best to fix that! As you can see from my code, I would like to include the Linear Regression algorithms in my array model too!

    Thank you for your help,

    Ernest

    • Avatar
      Jason Brownlee December 16, 2016 at 5:39 am #

      Hi Ernest, it is a classification problem. We cannot use LinearRegression.

      Try adding another classification algorithm to the list.

      • Avatar
        oumaima December 9, 2017 at 11:29 am #

        Hi Jason,
        I am new to ML. need your help so i can run this.

        >>> from matplotlib import pyplot
        Traceback (most recent call last):
        File “”, line 1, in
        File “c:\python27\lib\site-packages\matplotlib\pyplot.py”, line 29, in
        import matplotlib.colorbar
        File “c:\python27\lib\site-packages\matplotlib\colorbar.py”, line 32, in
        import matplotlib.artist as martist
        File “c:\python27\lib\site-packages\matplotlib\artist.py”, line 16, in
        from .path import Path
        File “c:\python27\lib\site-packages\matplotlib\path.py”, line 25, in
        from . import _path, rcParams
        ‘ImportError: DLL load failed: %1 n\x92est pas une application Win32 valide.\n’

  29. Avatar
    Gokul Iyer December 20, 2016 at 2:29 pm #

    Great tutorial! Quick question, for the when we create the models, we do models.append(name of algorithm, alogrithm function), is models an array? Because it seems like a dictionary since we have a key-value mapping (algorithm name, and algorithm function). Thank you!

    • Avatar
      Jason Brownlee December 20, 2016 at 2:47 pm #

      It is a list of tuples where each tuple contains a string name and a model object.

  30. Avatar
    Sasanka ghosh December 21, 2016 at 4:55 am #

    Hi Jason /any Gurus ,
    Good post and will follow it but my question may be little off track.
    Asking this question as i am a data modeller /aspiring data architect.

    I i feel as Guru/Gurus you can clarify my doubt. The question is at the end .

    In current Data management environment

    1. Data architecture /Physical implementation and choosing appropriate tools,back end,storage,no sql, SQL, MPP, sharding, columnar ,,scale up/out ,distributed processing etc .

    2. In addition to DB based procedural languages proficiency at at least one of the following i.e. Java/Python/Scala etc.

    3. Then comes this AI,Machine learning ,neural Networks etc .

    My question is regarding point 3 .

    I believe those are algorithms which needs deep functional knowledge and years of experience to add any value to business .

    Those are independent of data models and it’s ,physical implementation and part of Business user domain not data architecture domain .

    If i take your above example say now 10k users trying to do the similar kind of thing then points 1 and 2 will be Data architects a domain and point 3 will be business analyst domain . may be point 2 can overlap between them to some extent .

    Data Architect need not to be hands on/proficient in algorithms i.e. should have just some basic idea as Data architects job is not to invent business logic but implement the business logic physically to satisfy Business users/Analysts .

    Am i correct in my assumption as i find the certain things are nearly mutually exclusive and expectations/benchmarks should be set right?

    Regards
    sasanka ghosh

    • Avatar
      Jason Brownlee December 21, 2016 at 8:46 am #

      Hi Sasanka, sorry, I don’t really follow.

      Are you able to simplify your question?

      • Avatar
        Sasanka ghosh December 21, 2016 at 9:25 pm #

        Hi Jason ,
        Many thanks that u bothered to reply .

        Tried to rephrase and concise but still it is verbose . apologies for that.

        Is it expected from a data architect to be algorithm expert as well as data model/database expert?

        Algorithms are business centric as well as specific to particular domain of business most of the times.

        Giving u an example i.e. SHORTEST PATH ( take it as just an example in making my point)
        An organization is providing an app to provide that service .

        CAVEAT:Someone may say from comp science dept that it is the basic thing u learn but i feel it is still an algorithm not a data structure .

        if we take the above scenario in simplistic term the requirement is as follows

        1.there will be say million registered users
        2. one can say at least 10 % are using the app same time
        3. any time they can change their direction as per contingency like a military op so dumping the partial weighted graph to their device is not an option i.e. users will be connected to main server/server cluster.
        4. the challenge is storing the spatial data in DB in correct data model .
        scale out ,fault tolerance .
        5.implement the shortest path algo and display it using Python/java/Cipher/Oracle spatial/titan etc dynamically.

        My question is can a data architect work on this project who does not know the shortest path algorithm but have sufficient knowledge in other areas but the algo with verbose term provided to him/her to implement ?

        I m asking this question as now a days people are offering ready made courses etc i.e. machine learning ,NLP,Data scientist etc. and the scenario is confusing
        i feel it is misleading as no one can get expert in science overnight and vice versa.

        I feel Algorithms are pure science that is a separate discipline .
        But to implement it in large scale Scientists/programmers/architects needs to work in tandem with minimal overlapping but continuous discussion.

        Last but not the least if i make some sense what is the learning curve should i follow to try to be a data architect in unstructured data in general

        regards
        sasanka ghosh

        • Avatar
          Jason Brownlee December 22, 2016 at 6:35 am #

          Really this depends on the industry and the job. I cannot give you good advice for the general case.

          You can get valuable results without being an expert, this applies to most fields.

          Algorithms are a tool, use them as such. They can also be a science, but we practitioners don’t have the time.

          I hope that helps.

          • Avatar
            Sasanka ghosh December 22, 2016 at 7:00 pm #

            Thanks Jsaon.

            I appreciate your time and response .

            I just wanted to validate from a real techie/guru like u as the confusion or no perfect answer are being exploited by management/HR to their own advantage and practice use and throw policy or make people sycophants/redundant without following the basic management principle,

            The tech guys except ” few geniuses” are always toiling and management is opening the cork, enjoying at the same time .

            Regards
            sasanka ghosh

  31. Avatar
    Raveen Sachintha December 21, 2016 at 8:51 pm #

    Hello Jason,
    Thank you very much for these tutorials, i am new to ML and i find it very encouraging to do and end to end project to get started with rather than reading and reading without seeing and end, This really helped me..

    One question, when i tried this i got the highest accuracy for SVM.

    LR: 0.966667 (0.040825)
    LDA: 0.975000 (0.038188)
    KNN: 0.983333 (0.033333)
    CART: 0.983333 (0.033333)
    NB: 0.975000 (0.053359)
    SVM: 0.991667 (0.025000)

    so i decided to try that out too,,

    svm = SVC()
    svm.fit(X_train, Y_train)
    prediction = svm.predict(X_validation)

    these were my results using SVM,

    0.933333333333
    [[ 7 0 0]
    [ 0 10 2]
    [ 0 0 11]]
    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 7
    Iris-versicolor 1.00 0.83 0.91 12
    Iris-virginica 0.85 1.00 0.92 11

    avg / total 0.94 0.93 0.93 30

    I am still learning to read these results, but can you tell me why this happened? why did i get high accuracy for SVM instead of KNN?? have i done anything wrong? or is it possible?

    • Avatar
      Jason Brownlee December 22, 2016 at 6:33 am #

      The results reported are a mean estimated score with some variance (spread).

      It is an estimate on the performance on new data.

      When you apply the method on new data, the performance may be in that range. It may be lower if the method has overfit the training data.

      Overfitting is a challenge and developing a robust test harness to ensure we don’t fool/mislead ourselves during model development is important work.

      I hope that helps as a start.

  32. Avatar
    inzar December 25, 2016 at 7:04 am #

    i want to buy your book.
    i try this tutorial and the result is very awesome

    i want to learn from you

    thanks….

  33. Avatar
    lou December 25, 2016 at 7:29 am #

    Why the leading comma in X = array[:,0:4]?

  34. Avatar
    Thinh December 26, 2016 at 5:05 am #

    In 1.2 , should warn to install scikit-learn

    • Avatar
      Jason Brownlee December 26, 2016 at 7:49 am #

      Thanks for the note.

      Please see section 1.1 Install SciPy Libraries where it says:


      There are 5 key libraries that you will need to install… sklearn

  35. Avatar
    Tijo L. Peter December 28, 2016 at 10:34 pm #

    Best ML tutorial for Python. Thank you, Jason.

  36. Avatar
    baso December 29, 2016 at 12:38 am #

    when i tried run, i have error message” TypeError: Empty ‘DataFrame’: no numeric data to plot” help me

    • Avatar
      Jason Brownlee December 29, 2016 at 7:18 am #

      Sorry to hear that.

      Perhaps check that you have loaded the data as you expect and that the loaded values are numeric and not strings. Perhaps print the first few rows: print(df.head(5))

      • Avatar
        baso December 29, 2016 at 1:05 pm #

        thanks very much Jason for your time

        it worked. these tutorial very help for me. im new in Machine learning, but may you explain to me about your simple project above? because i did not see X_test and target

        regard in advance

  37. Avatar
    Andrea January 5, 2017 at 1:42 am #

    Thank you for sharing this. I bumped into some installation problems.
    Eventually, yo get all dependencies installed on MacOS 10.11.6 I had to run this:

    brew install python
    pip install –user numpy scipy matplotlib ipython jupyter pandas sympy nose scikit-learn
    export PATH=$PATH:~/Library/Python/2.7/bin

    • Avatar
      Jason Brownlee January 5, 2017 at 9:21 am #

      Thanks for sharing Andrea.

      I’m a macports guy myself, here’s my recipe:

  38. Avatar
    Sohib January 6, 2017 at 6:26 pm #

    Hi Jason,
    I am following this page as a beginner and have installed Anaconda as recommended.
    As I am on win 10, I installed Anaconda 4.2.0 For Windows Python 2.7 version (x64) and
    I am using Anaconda’s Spyder (python 2.7) IDE.

    I checked all the versions of libraries (as shown in 1.2 Start Python and Check Versions) and got results like below:

    Python: 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
    scipy: 0.18.1
    numpy: 1.11.1
    matplotlib: 1.5.3
    pandas: 0.18.1
    sklearn: 0.17.1

    At the 2.1 Import libraries section, I imported all of them and tried to load data as shown in
    2.2 Load Dataset. But when I run it, it doesn’t show an output, instead, there is an error:

    Traceback (most recent call last):
    File “C:\Users\gachon\.spyder\temp.py”, line 4, in
    from sklearn import model_selection
    ImportError: cannot import name model_selection

    Below is my code snippet:

    import pandas
    from pandas.tools.plotting import scatter_matrix
    import matplotlib.pyplot as plt
    from sklearn import model_selection
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
    names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    dataset = pandas.read_csv(url, names=names)
    print(dataset.shape)

    When I delete “from sklearn import model_selection” line I get expected results (150, 5).

    Am I missing something here?

    Thank you for your time and endurance!

    • Avatar
      Jason Brownlee January 7, 2017 at 8:23 am #

      Hi Sohib,

      You must have scikit-learn version 0.18 or higher installed.

      Perhaps Anaconda has documentation on how to update sklearn?

      • Avatar
        Sohib January 10, 2017 at 12:15 pm #

        Thank you for reply.

        I updated scikit-learn version to 0.18.1 and it helped.
        The error disappeared, the result is shown, but one statement

        ‘import sitecustomize’ failed; use -v for traceback

        is executed above the result.
        I tried to find out why, but apparently I might not find the reason.
        Is it going to be a problem in my further steps?
        How to solve this?

        Thank you in advance!

        • Avatar
          Jason Brownlee January 11, 2017 at 9:25 am #

          I’m glad to hear it fixed your problem.

          Sorry, I don’t know what “import sitecustomize” is or why you need it.

  39. Avatar
    Vishakha January 7, 2017 at 10:10 pm #

    Can i get the same tutorial with java

  40. Avatar
    Abhinav January 8, 2017 at 8:27 pm #

    Hi Jason,

    Nice tutorial.

    In univariate plots, you mentioned about gaussian distribution.

    According to the univariate plots, sepat-width had gaussian distribution. You said there are 2 variables having gaussain distribution. Please tell the other.

    Thanks

    • Avatar
      Jason Brownlee January 9, 2017 at 7:49 am #

      The distribution of the others may be multi-modal. Perhaps a double Gaussian.

  41. Avatar
    Thinh January 13, 2017 at 5:07 am #

    Hi, Jason. Could you please tell me the reason Why you choose KNN in example above ?

    • Avatar
      Jason Brownlee January 13, 2017 at 9:16 am #

      Hi Thinh,

      No reason other than it is an easy algorithm to run and understand and good algorithm for a first tutorial.

  42. Avatar
    Scott P January 13, 2017 at 10:25 pm #

    Hi Jason,

    I’m trying to use this code with the KDD Cup ’99 dataset, and I am having trouble with LabelEncoding my dataset in to numerical values.

    #Modules
    import pandas
    import numpy
    from pandas.tools.plotting import scatter_matrix
    import matplotlib.pyplot as plt
    from sklearn import preprocessing
    from sklearn import cross_validation
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.preprocessing import LabelEncoder
    #new
    from collections import defaultdict
    #

    #Load KDD dataset
    data_set = “NSL-KDD/KDDTrain+.txt”
    names = [‘duration’,’protocol_type’,’service’,’flag’,’src_bytes’,’dst_bytes’,’land’,’wrong_fragment’,’urgent’,’hot’,’num_failed_logins’,’logged_in’,’num_compromised’,’su_attempted’,’num_root’,’num_file_creations’,
    ‘num_shells’,’num_access_files’,’num_outbound_cmds’,’is_host_login’,’is_guest_login’,’count’,’srv_count’,’serror_rate’,’srv_serror_rate’,’rerror_rate’,’srv_rerror_rate’,’same_srv_rate’,’diff_srv_rate’,’srv_diff_host_rate’,
    ‘dst_host_count’,’dst_host_srv_count’,’dst_host_same_srv_rate’,’dst_host_diff_srv_rate’,’dst_host_same_src_port_rate’,’dst_host_srv_diff_host_rate’,’dst_host_serror_rate’,’dst_host_srv_serror_rate’,’dst_host_rerror_rate’,
    ‘dst_host_srv_rerror_rate’,’class’]

    #Diabetes Dataset
    #data_set = “Datasets/pima-indians-diabetes.data”
    #names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
    #data_set = “Datasets/iris.data”
    #names = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’,’class’]
    dataset = pandas.read_csv(data_set, names=names)

    array = dataset.values
    X = array[:,0:40]
    Y = array[:,40]

    label_encoder = LabelEncoder()
    label_encoder = label_encoder.fit(Y)
    label_encoded_y = label_encoder.transform(Y)

    validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, label_encoded_y, test_size=validation_size, random_state=seed)

    # Test options and evaluation metric
    num_folds = 7
    num_instances = len(X_train)
    seed = 7
    scoring = ‘accuracy’

    # Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))

    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
    cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
    print(msg)

    # Compare Algorithms
    fig = plt.figure()
    fig.suptitle(‘Algorithm Comparison’)
    ax = fig.add_subplot(111)
    plt.boxplot(results)
    ax.set_xticklabels(Y)
    plt.show()

    Am I doing something wrong with the LabelEncoding process?

    • Avatar
      MegO_Bonus June 4, 2017 at 7:15 pm #

      Hi. Change all symbols like “ to ” and ’ to ‘. LabaleEncoder will be work correct but not all network. I try to create a neural network for NSL-KDD too. Have you any good examples?

    • Avatar
      Rajnish July 17, 2019 at 8:21 am #

      How come it is concluded that KNN algorithm is accurate model when mean value for SVM algorithm is closer to 1 in comparison to KNN ?

      • Avatar
        Jason Brownlee July 17, 2019 at 8:32 am #

        Either algorithm would be effective on the dataset.

  43. Avatar
    Dan January 14, 2017 at 4:56 am #

    Hi, I’m running a bit of a different setup than yours.

    The modules and version of python I’m using are more recent releases:

    Python: 3.5.2 |Anaconda 4.2.0 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
    scipy: 0.18.1
    numpy: 1.11.3
    matplotlib: 1.5.3
    pandas: 0.19.2
    sklearn: 0.18.1

    And I’ve gotten SVM as the best algorithm in terms of accuracy at 0.991667 (0.025000).

    Would you happen to know why this is, considering more recent versions?

    I also happened to get a rather different boxplot but I’ll leave it at what I’ve said thus far.

  44. Avatar
    Duncan Carr January 17, 2017 at 1:44 am #

    Hi Jason

    I can’t tell you how grateful I am … I have been trawling through lots of ML stuff to try to get started with a “toy” example. Finally I have found the tutorial I was looking for. Anaconda had old sklearn: 0.17.1 for Windows – which caused an error “ImportError: cannot import name ‘model_selection'”. That was fixed by running “pip install -U scikit-learn” from the Anaconda command-line prompt. Now upgraded to 0.18. Now everything in your imports was fine.

    All other tutorials were either too simple or too complicated. Usually the latter!

    Thank you again 🙂

    • Avatar
      Jason Brownlee January 17, 2017 at 7:39 am #

      Glad to hear it Duncan.

      Thanks for the tip for Anaconda uses.

      I’m here to help if you have questions!

  45. Avatar
    Malathi January 17, 2017 at 3:13 am #

    Hi Jason,

    Wonderful service. All of your tutorials are very helpful
    to me. Easy to understand.

    Expecting more tutorials on deep neural networks.

    Malathi

    • Avatar
      Jason Brownlee January 17, 2017 at 7:40 am #

      You’re very welcome Malathi, glad to hear it.

  46. Avatar
    Duncan Carr January 17, 2017 at 7:32 pm #

    Hi Jason

    I managed to get it all working – I am chuffed to bits.

    I get exactly the same numbers in the classification report as you do … however, when I changed both seeds to 8 (from 7), then ALL of the numbers end up being 1. Is this good, or bad? I am a bit confused.

    Thanks again.

    • Avatar
      Jason Brownlee January 18, 2017 at 10:14 am #

      Well done Duncan!

      What do you mean all the numbers end up being one?

  47. Avatar
    Duncan Carr January 18, 2017 at 8:02 pm #

    Hi Jason

    I’ve output the “accuracy_score”, “confusion_matrix” & “classification_report” for seeds 7, 9 & 10. Why am I getting a perfect score with seed=9? Many thanks.

    (seed=7)

    0.9

    [[10 0 0]
    [ 0 8 1]
    [ 0 2 9]]

    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 10
    Iris-versicolor 0.80 0.89 0.84 9
    Iris-virginica 0.90 0.82 0.86 11

    avg / total 0.90 0.90 0.90 30

    (seed=9)

    1.0

    [[13 0 0]
    [ 0 9 0]
    [ 0 0 8]]

    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 13
    Iris-versicolor 1.00 1.00 1.00 9
    Iris-virginica 1.00 1.00 1.00 8

    avg / total 1.00 1.00 1.00 30

    (seed=10)

    0.9666666666666667

    [[10 0 0]
    [ 0 12 1]
    [ 0 0 7]]

    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 10
    Iris-versicolor 1.00 0.92 0.96 13
    Iris-virginica 0.88 1.00 0.93 7

    avg / total 0.97 0.97 0.97 30

  48. Avatar
    shivani January 20, 2017 at 8:40 pm #

    from sklearn import model_selection
    showing Import Error: can not import model_selection

    • Avatar
      Jason Brownlee January 21, 2017 at 10:25 am #

      You need to update your version of sklearn to 0.18 or higher.

  49. Avatar
    Jim January 22, 2017 at 5:06 pm #

    Jason

    Excellent Tutorial. New to Python and set a New Years Resolution to try to understand ML. This tutorial was a great start.

    I struck the issue of the sklearn version. I am using Ubuntu 16.04LTS which comes with python-sklearn version 0.17. To update to latest I used the site:
    http://neuro.debian.net/install_pkg.html?p=python-sklearn
    Which gives the commands to add the neuro repository and pull down the 0.18 version.

    Also I would like to note there is an error in section 3.1 Dimensions of the Dataset. Your text states 120 Instances when in fact 150 are returned, which you have in the Printout box.

    Keep up the good work.

    Jim

    • Avatar
      Jason Brownlee January 23, 2017 at 8:37 am #

      I’m glad to hear you worked around the version issue Jim, nice work!

      Thanks for the note on the typo, fixed!

  50. Avatar
    Raphael January 23, 2017 at 4:15 pm #

    hi Jason.nice work here. I’m new to your blog. What does the y-axis in the box plots represent?

    • Avatar
      Jason Brownlee January 24, 2017 at 11:01 am #

      Hi Raphael,

      The y-axis in the box-and-whisker plots are the scale or distribution of each variable.

  51. Avatar
    Kayode January 23, 2017 at 8:42 pm #

    Thank you for this wonderful tutorial.

  52. Avatar
    Raphael January 26, 2017 at 2:28 am #

    hi Jason,

    In this line

    dataset.groupby(‘class’).size()

    what other variable other than size could I use? I changed size with count and got something similar but not quite. I got key errors for the other stuffs I tried. Is size just a standard command?

  53. Avatar
    Scott January 26, 2017 at 10:35 pm #

    Jason,

    I’m trying to use a different data set (KDD CUP 99′) with the above code, but when I try and run the code after modifying “names” and the array to account for the new features and it will not run as it is giving me an error of: “cannot convert string to a float”.

    In my data set, there are 3 columns that are text and the rest are integers and floats, I have tried LabelEncoding but it gives me the same error, do you know how I can resolve this?

    • Avatar
      Jason Brownlee January 27, 2017 at 12:08 pm #

      Hi Scott,

      If the values are indeed strings, perhaps you can use a method that supports strings instead of numbers, perhaps like a decision tree.

      If there are only a few string values for the column, a label encoding as integers may be useful.

      Alternatively, perhaps you could try removing those string features from the dataset.

      I hope that helps, let me know how you go.

  54. Avatar
    Weston Gross January 31, 2017 at 10:41 am #

    I would like a chart to see the grand scope of everything for data science that python can do.

    You list 6 basic steps. For example in the visualizing step, I would like to know what all the charts are, what they are used for, and what python library it comes from.

    I am extremely new to all this, and understand that some steps have to happen for example

    1. Get Data
    2. Validate Data
    3. Missing Data
    4. Machine Learning
    5. Display Findinds

    So for missing data, there are techniques to restore the data, what are they and what libraries are used?

    • Avatar
      Jason Brownlee February 1, 2017 at 10:36 am #

      You can handle missing data in a few ways such as:

      1. Remove rows with missing data.
      2. Impute missing data (e.g. use the Imputer class in sklearn)
      3. Use methods that support missing data (e.g. decision trees)

      I hope that helps.

  55. Avatar
    Mohammed February 1, 2017 at 1:11 am #

    Hi Jason,

    I am a Non Tech Data Analyst and use SPSS extensively on Academic / Business Data over the last 6 years.

    I understand the above example very easily.

    I want to work on Search – Language Translation and develop apps.

    Whats the best way forward …

    Do you also provide Skype Training / Project Mentoring..

    Thanks in advance.

    • Avatar
      Jason Brownlee February 1, 2017 at 10:51 am #

      Thanks Mohammed.

      Sorry, I don’t have good advice for language translation applications.

  56. Avatar
    Mohammed February 1, 2017 at 1:14 am #

    I dont have any Development / Coding Background.

    However, following your guidelines I downloaded SciPy and tested the code.

    Everything worked perfectly fine.

    Looking forward to go all in…

  57. Avatar
    Purvi February 1, 2017 at 7:31 am #

    Hi Jason,

    I am new to Machine learning and am trying out the tutorial. I have following environment :

    >>> import sys
    >>> print(‘Python: {}’.format(sys.version))
    Python: 2.7.10 (default, Jul 13 2015, 12:05:58)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
    >>> import scipy
    >>> print(‘scipy: {}’.format(scipy.__version__))
    scipy: 0.18.1
    >>> import numpy
    >>> print(‘numpy: {}’.format(numpy.__version__))
    numpy: 1.12.0
    >>> import matplotlib
    >>> print(‘matplotlib: {}’.format(matplotlib.__version__))
    matplotlib: 2.0.0
    >>> import pandas
    >>> print(‘pandas: {}’.format(pandas.__version__))
    pandas: 0.19.2
    >>> import sklearn
    >>> print(‘sklearn: {}’.format(sklearn.__version__))
    sklearn: 0.18.1

    When I try to load the iris dataset, it loads up fine and prints dataset.shape but then my python interpreter hangs. I tried it out 3-4 times and everytime it hangs after I run couple of commands on dataset.
    >>> url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
    >>> names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    >>> dataset = pandas.read_csv(url, names=names)
    >>> print(dataset.shape)
    (150, 5)
    >>> print(dataset.head(20))
    sepal-length sepal-width petal-length petal-width class
    0 5.1 3.5 1.4 0.2 Iris-setosa
    1 4.9 3.0 1.4 0.2 Iris-setosa
    2 4.7 3.2 1.3 0.2 Iris-setosa
    3 4.6 3.1 1.5 0.2 Iris-setosa
    4 5.0 3.6 1.4 0.2 Iris-setosa
    5 5.4 3.9 1.7 0.4 Iris-setosa
    6 4.6 3.4 1.4 0.3 Iris-setosa
    7 5.0 3.4 1.5 0.2 Iris-setosa
    8 4.4 2.9 1.4 0.2 Iris-setosa
    9 4.9 3.1 1.5 0.1 Iris-setosa
    10 5.4 3.7 1.5 0.2 Iris-setosa
    11 4.8 3.4 1.6 0.2 Iris-setosa
    12 4.8 3.0 1.4 0.1 Iris-setosa
    13 4.3 3.0 1.1 0.1 Iris-setosa
    14 5.8 4.0 1.2 0.2 Iris-setosa
    15 5.7 4.4 1.5 0.4 Iris-setosa
    16 5.4 3.9 1.3 0.4 Iris-setosa
    17 5.1 3.5 1.4 0.3 Iris-setosa
    18 5.7 3.8 1.7 0.3 Iris-setosa
    19 5.1 3.8 1.5 0.3 Iris-setosa
    >>> print(datase

    It does not let me type anything further.
    I would appreciate your help.

    Thanks,
    Purvi

    • Avatar
      Jason Brownlee February 1, 2017 at 10:55 am #

      Hi Purvi, sorry to hear that.

      Perhaps you’re able to comment out the first parts of the tutorial and see if you can progress?

  58. Avatar
    sam February 5, 2017 at 9:24 am #

    Hi Jason

    i am planning to use python to predict customer attrition.I have current list of attrited customers with their attributes.I would like to use them as test data and use them to predict any new customers.Can you please help to approach the problem in python ?

    my test data :

    customer1 attribute1 attribute2 attribute3 … attrited

    my new data

    customer N, attribute 1,…… ?

    Thanks for your help in advance.

  59. Avatar
    Kiran Prajapati February 7, 2017 at 6:31 pm #

    Hello Sir, I want to check my data is how many % accurate, In my data , I have 4 columns ,

    Taluka , Total_yield, Rain(mm) , types_of soil

    Nasik 12555 63.0 dark black
    Igatpuri 1560 75.0 shallow

    So on,
    first, I have to check data is accurate or not, and next step is what is the predicted yield , using regression model.
    Here is my model Total_yield = Rain + types_of soil

    I use 0 and 1 binary variable for types_of soil.

    can you please help me, how to calculate data is accurate ? How many % ?
    and how to find predicted yield ?

  60. Avatar
    Saby February 15, 2017 at 9:11 am #

    # Load dataset
    url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
    names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
    dataset = pandas.read_csv(url, names=names)

    The dataset should load without incident.

    If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing url to the local file name.

    I am a very beginner python learner(trying to learn ML as well), I tried to load data from my local file but could not be successful. Will you help me out how exactly code should be written to open the data from local file.

    • Avatar
      Jason Brownlee February 15, 2017 at 11:39 am #

      Sure.

      Download the file as iris.data into your current working directory (where your python file is located and where you are running the code from).

      Then load it as:

  61. Avatar
    ant February 15, 2017 at 9:54 pm #

    Hi, Jason, first of all thank so much for this amazing lesson.

    Just for curiosity I have computed all the values obtained with dataset.describe() with excel and for the 25% value of petal-length I get 1.57500 instead of 1.60000. I have googled for formatting describe() output unsuccessfully. Is there an explanation? Tnx

    • Avatar
      Jason Brownlee February 16, 2017 at 11:07 am #

      Not sure, perhaps you could look into the Pandas source code?

      • Avatar
        ant February 17, 2017 at 12:23 am #

        OK, I will do.

  62. Avatar
    jacques February 16, 2017 at 4:42 pm #

    HI Jason

    I don’t quite follow the KFOLD section ?

    We started of with 150 data-entries(rows)

    We then use a 80/20 split for validation/training that leaves us with 120

    The split 10 boggles me ??
    Does it take 10 items from each class and train with 9 ? what does the other 1 left do then ?

    • Avatar
      Jason Brownlee February 17, 2017 at 9:52 am #

      Hi jacques,

      The 120 records are split into 10 folds. The model is trained on the first 9 folds and evaluated on the records in the 10th. This is repeated so that each fold is given a chance to be the hold out set. 10 models are trained, 10 scores collected and we report the mean of those scores as an estimate of the performance of the model on unseen data.

      Does thar help?

  63. Avatar
    Alhassan February 17, 2017 at 4:02 pm #

    I am trying to integrate machine learning into a PHP website I have created. Is there any way I can do that using the guidelines you provided above?

    • Avatar
      Jason Brownlee February 18, 2017 at 8:34 am #

      I have not done this Alhassan.

      Generally, I would advise developing a separate service that could be called using REST calls or similar.

      If you are working on a prototype, you may be able to call out to a program or script from cgi-bin, but this would require careful engineering to be secure in a production environment.

  64. Avatar
    Simão Gonçalves February 20, 2017 at 1:27 am #

    Hi Jason! This tutorial was a great help, i´m truly grateful for this so thank you.

    I have one question about the tutorial though, in the Scattplot Matrix i can´t understand how can we make the dots in the graphs whose variables have no relationship between them (like sepal-lenght with petal_width).

    Could you or someone explain that please? how do you make a dot that represents the relationship between a certain sepal_length with a certain petal-width

    • Avatar
      Jason Brownlee February 20, 2017 at 9:30 am #

      Hi Simão,

      The x-axis is taken for the values of the first variable (e.g. sepal_length) and the y-axis is taken for the second variable (e.g. petal_width).

      Does that help?

    • Avatar
      Yopo February 21, 2017 at 4:35 am #

      you match each iris instance’s length and width with each other. for example, iris instance number one is represented by a dot, and the dot’s values are the iris length and width! so actually, when you take all these values and put them on a graph you are basically checking to see if there is a relation. as you can see some in some of these plots the dots are scattered all around, but when we look at the petal width – petal length graph it seems to be linear! this means that those two properties are clearly related. hope this hepled!

  65. Avatar
    Sébastien February 20, 2017 at 9:34 pm #

    Hi Jason,

    from France and just to say you “Thank you for this very clear tutorial!”

    Sébastien

  66. Avatar
    Raj February 27, 2017 at 2:53 am #

    Hi Jason,
    I am new to ML & Python. Your post is encouraging and straight to the point of execution. Anyhow, I am facing below error when

    >>> validataion_size = 0.20
    >>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state = seed)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘validation_size’ is not defined

    What could be the miss out? I din’t get any errors in previous steps.

    My Environment details:
    OS: Windows 10
    Python : 3.5.2
    scipy : 0.18.1
    numpy : 1.11.1
    sklearn : 0.18.1
    matplotlib : 0.18.1

    • Avatar
      Jason Brownlee February 27, 2017 at 5:54 am #

      Hi Raj,

      Double check you have the code from section “5.1 Create a Validation Dataset” where validation_size is defined.

      I hope that helps.

  67. Avatar
    Roy March 2, 2017 at 7:38 am #

    Hey Jason,

    Can you please explain what precision,recall, f1-score, support actually refer to?
    Also what the numbers in a confusion matrix refers to?
    [ 7 0 0]
    [ 0 11 1]
    [ 0 2 9]]
    Thanks.

  68. Avatar
    santosh March 3, 2017 at 7:29 am #

    what code should i use to load data from my working directory??

  69. Avatar
    David March 7, 2017 at 8:27 am #

    Hi Jason,

    I have a ValueError and i don’t know how can i solve this problem

    My problem like that,

    ValueError: could not convert string to float: ‘2013-06-27 11:30:00.0000000’

    Can u give some information abaut the fixing this problem?

    Thank you

    • Avatar
      Jason Brownlee March 7, 2017 at 9:39 am #

      It looks like you are trying to load a date-time. You might need to write a custom function to parse the date-time when loading or try removing this column from your dataset.

  70. Avatar
    Saugata De March 8, 2017 at 6:11 am #

    >>> for name, model in models:
    … kfold=model_selection.Kfold(n_splits=10, random_state=seed)
    … cv_results =model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    … results.append(cv_results)
    … names.append(name)
    … msg=”%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    … print(msg)

    After typing this piece of code, it is giving me this error. can you plz help me out Jason. Since I am new to ML, dont have so much idea about the error.

    Traceback (most recent call last):
    File “”, line 2, in
    AttributeError: module ‘sklearn.model_selection’ has no attribute ‘Kfold’

    • Avatar
      Asad Ali July 23, 2017 at 12:59 pm #

      the KFold function is case-sensitive. It is ” model_selection.KFold(…) ” not ” model_selection.Kfold(…) ”
      update this line:
      kfold=model_selection.KFold(n_splits=10, random_state=seed)

      • Avatar
        ibtssam February 12, 2018 at 9:17 pm #

        THANK U

  71. Avatar
    Ojas March 10, 2017 at 10:58 am #

    Hello Jason ,
    Thanks for writing such a nice and explanatory article for beginners like me but i have one concern , i tried finding it out on other websites as well but could not come up with any solution.
    Whatever i am writing inside the code editor (Jupyter Qtconsole in my case) , can this not be save as a .py file and shared with my other members over github maybe?. I found some hacks though but i have a thinking that there must be some proper way of sharing the codes written in the editor. , like without the outputs or plots in between.

    • Avatar
      Jason Brownlee March 11, 2017 at 7:55 am #

      You can write Python code in a text editor and save it as a myfile.py file. You can then run it on the command line as follows:

      Consider picking up a book on Python.

  72. Avatar
    manoj maracheea March 11, 2017 at 9:37 pm #

    Hello Jason,

    Nice tutorials I done this today.

    I didn’t really understand everything, { I will follow your advice, will do it again, write all the question down, and use the help function.}

    The tutorials just works, I take around 2 hours to do it typing every single line.
    install all the dependencies, run on each blocks types, to check.

    Thanks, I be visiting your blogs, time to time.

    Regards,

    • Avatar
      Jason Brownlee March 12, 2017 at 8:23 am #

      Well done, and thanks for your support.

      Post any questions you have as comments or email me using the “contact” page.

  73. Avatar
    manoj maracheea March 11, 2017 at 9:38 pm #

    Just I am a beginner too, I am using Visual studio code.

    Look good.

  74. Avatar
    Vignesh R March 13, 2017 at 9:59 pm #

    What exactly is confusion matrix?

  75. Avatar
    Dan R. March 14, 2017 at 7:09 am #

    Can I ask what is the reason of this problem? Thank for answer 🙂 :
    (In my code is just the section, where I Import all the needed libraries..)
    I have all libraries up to date, but it still gives me this error->

    File “C:\Users\64dri\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py”, line 32, in
    from ..utils.fixes import rankdata

    ImportError: cannot import name ‘rankdata’

    ( scipy: 0.18.1
    numpy: 1.11.1
    matplotlib: 1.5.3
    pandas: 0.18.1
    sklearn: 0.17.1)

    • Avatar
      Jason Brownlee March 14, 2017 at 8:31 am #

      Sorry, I have not seen this issue Dan, consider searching or posting to StackOverflow.

  76. Avatar
    Cameron March 15, 2017 at 5:28 am #

    Jason,

    You’re a rockstar, thank you so much for this tutorial and for your books! It’s been hugely helpful in getting me started on machine learning. I was curious, is it possible to add a non-number property column, or will the algorithms only accept numbers?

    For example, if there were a “COLOR” column in the iris dataset, and all Iris-Setosa were blue. how could I get this program to accept and process that COLOR column? I’ve tried a few things and they all seem to fail.

    • Avatar
      Jason Brownlee March 15, 2017 at 8:16 am #

      Great question Cameron!

      sklearn requires all input data to be numbers.

      You can encode labels like colors as integers and model that.

      Further, you can convert the integers to a binary encoding/one-hot encoding which may me more suitable if there is no ordinal relationship between the labels.

      • Avatar
        Cameron March 15, 2017 at 2:19 pm #

        Jason, thanks so much for replying! That makes a lot of sense. When you say binary/one-hot encoding I assume you mean (continuing to use the colors example) adding a column for each color (R,O,Y,G,B,V) and for each flower putting a 1 in the column of it’s color and a 0 for all of the other colors?
        That’s feasible for 6 colors (adding six columns) but how would I manage if I wanted to choose between 100 colors or 1000 colors? Are there other libraries that could help deal with that?

  77. Avatar
    James March 19, 2017 at 6:54 am #

    for name, model in models:
    … kfold = cross_vaalidation.KFold(n=num_instances,n_folds=num_folds,random_state=seed)
    … cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    File “”, line 3
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    ^
    SyntaxError: invalid syntax
    >>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘model’ is not defined
    >>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘kfold’ is not defined
    >>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv =
    kfold, scoring = scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘kfold’ is not defined
    >>> names.append(name)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘name’ is not defined

    I am new to python and getting these errors after running 5.3 models. Please help me.

    • Avatar
      Jason Brownlee March 19, 2017 at 9:12 am #

      It looks like you might not have copied all of the code required for the example.

  78. Avatar
    Mier March 20, 2017 at 10:26 am #

    Hi, I went through your tutorial. It is super great!
    I wonder whether you can recommend a data set that is similar to Iris classification for me to practice?

  79. Avatar
    Medine H. March 23, 2017 at 2:56 am #

    Hi Jason,

    That’s an amazing tutorial, quite clear and useful.

    Thanks a bunch!

  80. Avatar
    Sean March 23, 2017 at 9:54 am #

    Hi Jason,

    Can you let me know how can I start with Fraud Detection algorithms for a retail website ?

    Thanks,
    Sean

  81. Avatar
    Raja March 24, 2017 at 11:08 am #

    You are doing great with your work.

    I need your suggestion, i am working on my thesis here i need to work on machine learning.
    Training : positive ,negative, others
    Test : unknown data
    Want to train machine with training and test with unknown data using SVM,Naive,KNN

    How can i make the format of training and test data ?
    And how to use those algorithms in it
    Using which i can get the TP,TN,FP,FN
    Thanking you..

  82. Avatar
    Sey March 26, 2017 at 12:38 am #

    I m new in Machine learning and this was a really helpful tutorial. I have maybe a stupid question I wanted to plot the predictions and the validation value and make a visual comparison and it doesn’t seem like I really understood how I can plot it.
    Can you please send me the piece of code with some explanations to do it ?

    thank you very much

    • Avatar
      Jason Brownlee March 26, 2017 at 6:13 am #

      You can use matplotlib, for example:

  83. Avatar
    Kamol Roy March 26, 2017 at 7:25 am #

    Thanks a lot. It was very helpful.

  84. Avatar
    Rajneesh March 29, 2017 at 11:31 pm #

    Hi

    Sorry for a dumb question.

    Can you briefly describe, what the end result means (i.e.. what the program has predicted)

    • Avatar
      Jason Brownlee March 30, 2017 at 8:53 am #

      Given an input description of flower measurements, what species of flower is it?

      We are predicting the iris flower species as one of 3 known species.

  85. Avatar
    Anusha Vidapanakal March 30, 2017 at 3:58 am #

    LR: 0.966667 (0.040825)
    LDA: 0.975000 (0.038188)
    KNN: 0.983333 (0.033333)
    CART: 0.975000 (0.038188)
    NB: 0.975000 (0.053359)
    SVM: 0.991667 (0.025000)

    Why am I getting the highest accuracy for SVM?

    I’m a beginner, there was a similar query above but I couldn’t quite understand your reply.

    Could you please help me out? Have I done any mistake?

    • Avatar
      Jason Brownlee March 30, 2017 at 8:56 am #

      Why is a very hard question to answer.

      Our role is to find what works, ensure the results are robust, then figure out how we can use the model operationally.

      • Avatar
        Anusha Vidapanakal March 30, 2017 at 11:33 pm #

        Okay. Thanks a lot for the prompt response!

        The tutorial was very helpful.

  86. Avatar
    Vinay March 31, 2017 at 11:10 pm #

    Great tutorial Jason!
    My question is, if I want some new data from a user, how do I do that? If in future I develop my own machine learning algorithm, how do I use it to get some new data?
    What steps are taken to develop it?
    And thanks for this tutorial.

    • Avatar
      Jason Brownlee April 1, 2017 at 5:56 am #

      Not sure I understand. Collect new data from your domain and store it in a CSV or write code to collect it.

  87. Avatar
    walid barakeh April 2, 2017 at 6:31 pm #

    Hi Jason,
    I have a question regards the step after trained the data and know the better algorithm for our case, how we could know the rules formula that the algorithm produced for future uses ?

    and thanks for the tutorial, its really helpful

    • Avatar
      Jason Brownlee April 4, 2017 at 9:06 am #

      You can extract the weights if you like. Not sure I understand why you want the formula for the network. It would be complex and generally unreadable.

      You can finalize the mode, save the weights and topology for later use if you like.

  88. Avatar
    Divya April 4, 2017 at 4:58 pm #

    Thank you so much…this document really helped me a lot…..i was searching for such a document since a long time…this document gave the actual view of how machine learning is implemented through python….Books and courses are really difficult to understand completely and begin with development of project on such a vast concept… books n videos gave me lots of snippets, but i was not understanding how they all fit together.

  89. Avatar
    Divya April 4, 2017 at 5:00 pm #

    can i get such more tutorials for more detailed understanding?……..It will be really helpfull.

  90. Avatar
    Gav April 11, 2017 at 5:17 pm #

    Can’t load the iris dataset either through the url or copied to working folder without the NameError: name ‘pandas’ is not defined

  91. Avatar
    Ursula April 13, 2017 at 7:33 pm #

    Hi Jason,

    Your tutorial is fantastic!
    I’m trying to follow it but gets stuck on 5.3 Build Models

    When I copy your code for this section I get a few Errors
    IndentationError: excpected an indented block
    NameError: name ‘model’ is not defined
    NameError: name ‘cv_results’ is not defined
    NameError: name ‘name’ is not defined

    Could you please help me find what I’m doing wrong?
    Thanks!

    see the code and my “results” below:

    >>> # Spot Check Algorithms
    … models = []
    >>> models.append((‘LR’, LogisticRegression()))
    >>> models.append((‘LDA’, LinearDiscriminantAnalysis()))
    >>> models.append((‘KNN’, KNeighborsClassifier()))
    >>> models.append((‘CART’, DecisionTreeClassifier()))
    >>> models.append((‘NB’, GaussianNB()))
    >>> models.append((‘SVM’, SVC()))
    >>> # evaluate each model in turn
    … results = []
    >>> names = []
    >>> for name, model in models:
    … kfold = model_selection.KFold(n_splits=10, random_state=seed)
    File “”, line 2
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    ^
    IndentationError: expected an indented block
    >>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘model’ is not defined
    >>> results.append(cv_results)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘cv_results’ is not defined
    >>> names.append(name)
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘name’ is not defined
    >>> msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    Traceback (most recent call last):
    File “”, line 1, in
    NameError: name ‘name’ is not defined
    >>> print(msg)

    • Avatar
      Jason Brownlee April 14, 2017 at 8:43 am #

      Make sure you have the same tab indenting as in the example. Maybe re-add the tabs yourself after you copy-paste the code.

      • Avatar
        Nathan Wilson March 26, 2018 at 11:16 am #

        I’m having this same problem. How would I add the Indentations after I paste the code? Whenever I paste the code, it automatically executes the code.

        • Avatar
          Jason Brownlee March 26, 2018 at 2:27 pm #

          How to copy code from the tutorial:

          1. Click the copy button on the code example (top right of code box, second from the end). This will select all code in the box.
          2. Copy the code to the cipboard (control-c on windows, command-c on mac, or right click and click copy).
          3. Open your text editor.
          4. Paste the code from the clip board.

          This will preserve all white space.

          Does that help?

  92. Avatar
    Davy April 14, 2017 at 10:14 pm #

    Hi, one beginner question. What do we get after training is completed in supervised learning, for classification problem ? Do we get weights? How do i use the trained model after that in field, for real classification application lets say? I didn’t get the concept what happens if training is completed. I tried this example: https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py and it printed me accuracy and loss of test data. Then what now?

  93. Avatar
    Manikandan April 14, 2017 at 11:36 pm #

    Wow… It’s really great stuff man…. Thanks you….

  94. Avatar
    Wes April 15, 2017 at 3:16 am #

    As a complete beginner, it sounds so cool to predict the future. Then I saw all these model and complicated stuff, how do I even begin. Thank you for this. It is really great!

  95. Avatar
    Manjushree Aithal April 16, 2017 at 7:41 am #

    Hello Jason,

    I just started following your step by step tutorial for machine learning. In importing libraries step I followed each and every steps you specified, install all libraries via conda, but still I’m getting the following error.

    Traceback (most recent call last):
    File “C:/Users/dell/PycharmProjects/machine-learning/load_data.py”, line 13, in
    from sklearn.linear_model import LogisticRegression
    File “C:\Users\dell\Anaconda2\lib\site-packages\sklearn\linear_model\__init__.py”, line 15, in
    from .least_angle import (Lars, LassoLars, lars_path, LarsCV, LassoLarsCV,
    File “C:\Users\dell\Anaconda2\lib\site-packages\sklearn\linear_model\least_angle.py”, line 24, in
    from ..utils import arrayfuncs, as_float_array, check_X_y
    ImportError: DLL load failed: Access is denied.

    Can you please help me with this?

    Thank You!

    • Avatar
      Jason Brownlee April 16, 2017 at 9:33 am #

      I have not seen this error and I don’t know about windows sorry.

      It looks like you might not have admin permissions on your workstation.

  96. Avatar
    Olah Data Semarang April 17, 2017 at 3:03 pm #

    Tutorial DEAP Version 2.1
    https://www.youtube.com/watch?v=drd11htJJC0
    A Data Envelopment Analysis (Computer) Program. This page describes the computer program Tutorial DEAP Version 2.1 which was written by Tim Coelli.

  97. Avatar
    Federico Carmona April 18, 2017 at 4:41 am #

    Good afternoon Dr. Jason could help me with the next problem. How could you modify the KNN algorithm to detect the most relevant variables?

    • Avatar
      Jason Brownlee April 18, 2017 at 8:34 am #

      You can use feature importance scores from bagged trees or gradient boosting.

      Consider using sklearn to calculate and plot feature importance.

  98. Avatar
    Bharath April 18, 2017 at 10:09 pm #

    Thank u…

  99. Avatar
    Amal April 26, 2017 at 6:14 pm #

    Hi Jason

    Thanx for the great tutorial you provided.
    I’m also new to MC and python. I tried to use my csv file as you used iris data set. Though it successfully loaded the dataset gives following error.

    could not convert string to float: LipCornerDepressor

    LipCornerDepressor is normal value such as 0.32145 in excel sheet taken from sql server

    Here is the code without library files.

    # Load dataset
    url = “F:\FINAL YEAR PROJECT\Amila\FTdata.csv”
    names = [‘JawLower’, ‘BrowLower’, ‘BrowRaiser’, ‘LipCornerDepressor’, ‘LipRaiser’,’LipStretcher’,’Emotion_Id’]
    dataset = pandas.read_csv(url, names=names)

    # shape
    print(dataset.shape)

    # class distribution
    print(dataset.groupby(‘Emotion_Id’).size())

    # Split-out validation dataset
    array = dataset.values
    X = array[:,0:4]
    Y = array[:,4]
    validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

    # Test options and evaluation metric
    seed = 7
    scoring = ‘accuracy’

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))
    models.append((‘KNN’, KNeighborsClassifier()))
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))

    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    print(msg)

    • Avatar
      Jason Brownlee April 27, 2017 at 8:37 am #

      This error might be specific to your data.

      Consider double checking that your data is loaded as you expect. Maybe print some raw data or plots to confirm.

  100. Avatar
    Chanaka April 27, 2017 at 6:31 am #

    Thank you very much for the easy to follow tutorial.

  101. Avatar
    Sonali Deshmukh April 27, 2017 at 7:07 pm #

    Hi, Jason

    Your posts are really good…..
    I’m very naive to Python and Machine Learning.
    Can you please suggest good reads to get basic clear for machine learning.

  102. Avatar
    lanndo April 28, 2017 at 2:26 am #

    Outstanding work on this. I am curious how to port out results that show which records were matched to what in the predictor, when I print(predictions) it does not show what records they are paired with. Thanks!

    • Avatar
      Jason Brownlee April 28, 2017 at 7:51 am #

      Thanks!

      The index can be used to align predictions with inputs. For example, the first prediction is for the first input, and so on.

  103. Avatar
    NAVKIRAN KAUR April 29, 2017 at 4:28 pm #

    when I am applying all the models and printing message it shows me the error that it cannot convert string to float. how to resolve this error. my data set is related to fake news … title, text, label

    • Avatar
      Jason Brownlee April 30, 2017 at 5:27 am #

      Ensure you have converted your text data to numerical values.

  104. Avatar
    Shravan May 1, 2017 at 6:29 am #

    Awesome tutorial on basics of machine learning using Python. Thank you Jason!

  105. Avatar
    Shravan May 1, 2017 at 6:36 am #

    Am using Anaconda Python and I was writing all the commands/ program in the ‘python’ command line, am trying to find a way to save this program to a file? I have tried ‘%save’, but it errored out, any thoughts?

    • Avatar
      Jason Brownlee May 2, 2017 at 5:51 am #

      You can write your programs in a text file then run them on the command line as follows:

  106. Avatar
    Jason May 1, 2017 at 2:05 pm #

    Thank you for the help and insight you provide. When I run the actual validation data through the algorithms, I get a different feel for which one may be the best fit.

    Validation Test Accuracy:
    LR…….0.80
    LDA…..0.97
    KNN….0.90
    CART..0.87
    NB…….0.83
    SVM….0.93

    My question is, should this influence my choice of algorithm?

    Thank you again for providing such a wealth of information on your blog.

  107. Avatar
    rahman May 3, 2017 at 11:09 pm #

    # Split-out validation dataset
    array = dataset.values
    X = array[:,0:4]
    Y = array[:,4]

    from my dataset , When i give Y=array[:,1] Its working , but if give 2 or 3 or 4 instead of 1 it gives following error !!
    But all columns have similar kind of data .

    Traceback (most recent call last):
    File “/alok/c-analyze/analyze.py”, line 390, in
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
    for train, test in cv_iter)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
    while self.dispatch_one_batch(iterator):
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
    self._dispatch(tasks)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
    result = ImmediateResult(func)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
    self.results = batch()
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
    File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 468, in fit
    self._solve_svd(X, y)
    File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 378, in _solve_svd
    fac = 1. / (n_samples – n_classes)

    ZeroDivisionError: float division by zero

    • Avatar
      Jason Brownlee May 4, 2017 at 8:08 am #

      Perhaps take a closer look at your data.

      • Avatar
        rahman May 4, 2017 at 4:29 pm #

        But the very similar in all the columns .

        • Avatar
          rahman May 4, 2017 at 4:37 pm #

          I meant there is no much difference in data from each columns ! but still its working only for first column !! It gives the above error for any other column i choose .

          • Avatar
            rahman May 4, 2017 at 4:46 pm #

            Have a look at the data :

            index,1column,2 column,3column,….,8column
            0,238,240,1103,409,1038,4,67,0
            1,41,359,995,467,1317,8,71,0
            2,102,616,1168,480,1206,7,59,0
            3,0,34,994,181,1115,4,68,0
            4,88,1419,1175,413,1060,8,71,0
            5,826,10886,1316,6885,2086,263,119,0
            6,88,472,1200,652,1047,7,64,0
            7,0,322,957,533,1062,11,73,0
            8,0,200,1170,421,1038,5,63,0
            9,103,1439,1085,1638,1151,29,66,0
            10,0,1422,1074,4832,1084,27,74,0
            11,1828,754,11030,263845,1209,10,79,0
            12,340,1644,11181,175099,4127,13,136,0
            13,71,1018,1029,2480,1276,18,66,1
            14,0,3077,1116,1696,1129,6,62,0

            “”””””
            ‘”””””
            Total 105 data records

            But the above error does not occur for 1 column , that is when Y = 1 column,
            But the above same error happens when i choose any other column 2 , 3 or 4 .

  108. Avatar
    hairo May 3, 2017 at 11:13 pm #

    How to plot the graph for actual value against the predicted value here ?

    How to save this plotted graphs and again view them back when required from terminal itself ?

    • Avatar
      Jason Brownlee May 4, 2017 at 8:08 am #

      It would make for a dull graph as this is a classification problem.

      You might be better of reviewing the confusion matrix of a set of predictions.

  109. Avatar
    Sudarshan May 5, 2017 at 12:18 pm #

    How this can be applied to predict the value if stastical dataset is given
    Say i have given with past 10 years house price now i want to predict the value for house in next one year, two year

    Can you help me out in this

    I m amature in ML

    Thank for this tutorial
    It gives me a good kickstart to ML

    I m waiting for your reply

    • Avatar
      Jason Brownlee May 6, 2017 at 7:30 am #

      This is called a time series forecasting problem.

      You can learn more about how to work through time series forecasting problems here:
      https://machinelearningmastery.mystagingwebsite.com/start-here/#timeseries

      • Avatar
        Sudarshan May 6, 2017 at 3:15 pm #

        I getting trouble in doing that please help me out with any simple example

        Example I have a dataset containing plumber work Say
        attributes are
        experience_level , date, rating, price/hour
        I want to predict the price/hour for the next date base on experience level and average rating can you please help me regarding this.

  110. Avatar
    Bane May 8, 2017 at 4:30 am #

    Great job with the tutorial, it was really helpful.

    I want to ask, how can I use the techics above with a dataset that is not just one line with a few values, but a matrix NX3 with multiple values (measurements from an accelerometer). Is there a tutorial? How can I look up to it?

    • Avatar
      Jason Brownlee May 8, 2017 at 7:46 am #

      Each feature would be a different input variable as in the example above.

  111. Avatar
    Shud May 9, 2017 at 12:04 am #

    Hey Jason,

    I have built a linear regression model. y intercept is abnormally high (0.3 million) and adjusted r2 = 0.94. I would like to know what does high intercept mean?

    • Avatar
      Jason Brownlee May 9, 2017 at 7:45 am #

      Think of the intercept as the bias term.

      Many books have been written on linear regression and much is known about how to analyze these models effectively. I would recommend diving into the statistics literature.

  112. Avatar
    MK May 11, 2017 at 12:19 am #

    Excellent tutorial, i am moving from PHP to Python and taking baby steps. I used the Thonny IDE (http://thonny.org/) which is also very useful for python beginners.

  113. Avatar
    Tmoe May 14, 2017 at 4:31 am #

    Thank you so much, Jason! I’m new to machine learning and python but found your tutorial extremely helpful and easy to follow – thank you for posting!

  114. Avatar
    melody12ab May 15, 2017 at 6:07 pm #

    Thanks for all,now I am starting use ML!!!

  115. Avatar
    smith May 15, 2017 at 9:36 pm #

    # Spot Check Algorithms
    models = []
    models.append((‘LR’, LogisticRegression()))
    models.append((‘LDA’, LinearDiscriminantAnalysis()))

    When i print models , this is the output :

    [(‘LR’, LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
    intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,
    penalty=’l2′, random_state=None, solver=’liblinear’, tol=0.0001,
    verbose=0, warm_start=False)), (‘LDA’, LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
    solver=’svd’, store_covariance=False, tol=0.0001)), (‘KNN’, KNeighborsClassifier(algorithm=’auto’, leaf_size=30, metric=’minkowski’,
    metric_params=None, n_jobs=1, n_neighbors=5, p=2,
    weights=’uniform’))

    What are these extra values inside LogisticRegression (…) and for all the other algorithms ?

    How did they get appended ?

  116. Avatar
    pasha May 15, 2017 at 9:45 pm #

    When i print kfold :

    KFold(n_splits=7, random_state=7, shuffle=False)

    What is shuffle ? How did this value get added , as we had only done this :

    kfold = model_selection.KFold(n_splits=10, random_state=seed)

    • Avatar
      Jason Brownlee May 16, 2017 at 8:44 am #

      Whether or not to shuffle the dataset prior to splitting into folds.

      • Avatar
        pasha May 16, 2017 at 3:17 pm #

        Now i understand , jason thanks for amazing tutorials . Just one suggestion along with the codes give a link for reference in detail about this topics !

  117. Avatar
    sita May 15, 2017 at 9:48 pm #

    Hello jason

    This is an amazing blog , Thank you for all the posts .

    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

    Whats scoring here ? can you explain in detail ” model_selection.cross_val_score ” this line please .

  118. Avatar
    rahman May 15, 2017 at 10:27 pm #

    Please help me with this error Jason ,

    ERROR :

    Traceback (most recent call last):
    File “/rahman/c-analyze/analyze.py”, line 390, in
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
    for train, test in cv_iter)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
    while self.dispatch_one_batch(iterator):
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
    self._dispatch(tasks)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
    result = ImmediateResult(func)
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
    self.results = batch()
    File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
    File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 468, in fit
    self._solve_svd(X, y)
    File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 378, in _solve_svd
    fac = 1. / (n_samples – n_classes)

    ZeroDivisionError: float division by zero

    # Split-out validation dataset

    My code :

    array = dataset.values
    X = array[:,0:4]

    if field == “rh”: #No error if i select this col
    Y = array[:,0]

    elif field == “rm”: #gives the above error
    Y = array[:,1]

    elif field == “wh”: #gives the above error
    Y = array[:,2]

    elif field == “wm”: #gives the above error
    Y = array[:,3]

    Have a look at the data :

    index,1column,2 column,3column,….,8column
    0,238,240,1103,409,1038,4,67,0
    1,41,359,995,467,1317,8,71,0
    2,102,616,1168,480,1206,7,59,0
    3,0,34,994,181,1115,4,68,0
    4,88,1419,1175,413,1060,8,71,0
    5,826,10886,1316,6885,2086,263,119,0
    6,88,472,1200,652,1047,7,64,0
    7,0,322,957,533,1062,11,73,0
    8,0,200,1170,421,1038,5,63,0
    9,103,1439,1085,1638,1151,29,66,0
    10,0,1422,1074,4832,1084,27,74,0
    11,1828,754,11030,263845,1209,10,79,0
    12,340,1644,11181,175099,4127,13,136,0
    13,71,1018,1029,2480,1276,18,66,1
    14,0,3077,1116,1696,1129,6,62,0

    “”””””
    ‘”””””
    Total 105 data records

    But the above error does not occur for 1 column , that is when Y = 1 column,

    But the above same error happens when i choose any other column 2 , 3 or 4 .

    • Avatar
      Jason Brownlee May 16, 2017 at 8:45 am #

      Perhaps try scaling your data?

      Perhaps try another algorithm?

  119. Avatar
    suma May 16, 2017 at 12:05 am #

    fac = 1. / (n_samples – n_classes)

    ZeroDivisionError: float division by zero

    What is this error : fac = 1. / (n_samples – n_classes) ?

    Where is n_samples and n_classes used ?

    What may be the possible reason for this error ?

  120. Avatar
    bob May 22, 2017 at 6:46 pm #

    thank you Dr Jason it is really very helpfully. 🙂

  121. Avatar
    Krithika May 24, 2017 at 12:24 am #

    Hi Jason
    Great starting tutorial to get the whole picture. Thank you:)
    I am a newbie to machine learning. Could you please tell why you have specifically chosen these 6 models?

    • Avatar
      Jason Brownlee May 24, 2017 at 4:57 am #

      No specific reason, just a demonstration of spot checking a suite of methods on the problem.

  122. Avatar
    Ram Gour May 25, 2017 at 8:24 pm #

    Hi Jason, I am new to Python, but found this blog really helpful. I tried executing the code and it return all the result as mention above by you, except few graph.
    The scatter matrix graph and the evaluation on 6 algorithm did not open on my machine but its showing result on my colleague machine. I checked all the version and its higher or same as you mentioned in blog.
    Can you help if this issue can be resolved on my machine?

    • Avatar
      Jason Brownlee June 2, 2017 at 11:44 am #

      Perhaps check the configuration of matplotlib and ensure you can create simple graphs on your machine?

  123. Avatar
    sridhar May 25, 2017 at 8:50 pm #

    Great tutorial.

    How do I approach when the data set is not of any classification type and the number of attributes or just 2 – 1 is input and the other is output

    say I have number of processes as input and cpu usage as output..
    data set looks like [10, 5] [15, 7] etc…

    • Avatar
      Jason Brownlee June 2, 2017 at 11:45 am #

      If the output is real-valued, it would be a regression problem. You would need to use a loss function like MSE.

  124. Avatar
    pierre May 27, 2017 at 9:45 pm #

    Many thanks for this — I already got a lot out of this. I feel like a monkey though because I was neither familiar enough with python nor had any clue of ML back alleys yesterday. Today I can see plots on my screen and even if I have no clue what I’m looking at, this is where I wanted to be, so thanks!

    A few minor suggestions to make this perhaps even more dummy-proof:

    – I’m on Mac and I used python3 because python2 is weirdly set up out of the box and you can’t update easily the libraries needed. I understand you link, rightfully to external installation instructions, so just to say, this stuff works in python3 if you needed further testimony.

    – when drawing plots, I started freaking out because the terminal became unresponsive. So if you just made an (unessential) suggestion to run plt.ion() first, linking to, for example: https://matplotlib.org/faq/usage_faq.html#what-is-interactive-mode, it might help dummies like me to not give up too easily. (BTW I find your use command line philosophy and don’t let toolsets get in the way a great one indeed!)

    – There seems to be some ‘hack’ involved when defining the dataset, suppose there are no headers and so on… how do you get to load your dataset with an insightful name vector in the first palce (you don’t…) So just a hint of clarification would help here feeling we can trust that we do the right thing in this case because the data is well understood (I mean, this is not really a big deal eh it’s all par for the course but if I didn’t have similar experience in R I’d feel completely lost I think).

    I was a bit puzzled by the following sentence in 3.3:

    “We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.”

    Well, just looking at the table, I actually can’t see any of this. There is in fact really nothing telling this to us in the snippet, right? The sentence is a comment based on prior understanding of the dataset. Maybe this could be clarified so clueless readers don’t agonise over whether they are missing some magical power of insight.

    – Overall, I could run this and to some extent adapt it quickly to a different dataset until it became relevant what the data was like. I’m stumbling on the data manipulation for 5.1. I suppose it is both because I don’t know python structures and also because I have no clue what is being done in the selection step.

    I think in answer to a previous comment you link to doc for the relevant selection function, perhaps it would still be useful to have an extra, ‘for dummies’, detailed explanation of

    X = array[:,0:4]
    Y = array[:,4]

    in the context of the iris dataset. This is what I have to figure out, I think, in order to apply it to say, a 11 column dataset and it would be useful to known what I’m trying to do.

    The rest of the difficulties I have are with regards to interpretation of the output and it is fair to say this is outside of the scope of your tutorial which puts dummies like me in a very good position to try to understand while being able to fiddle with a bit of code. All the above comments are extremely minor and really about polishing the readibility for ultimate noobs, they are not really important and your tutorial is a great and efficient resource.

    Thanks again!
    Pierre

  125. Avatar
    Shaksham Kapoor June 6, 2017 at 4:18 am #

    I’m not able to figure out , what errors does the confusion matrix represents ? and what does each column(precision, recall, f1-score, support) in the classification report signifies ?

    And last but not the least thanks a lot Sir for this easy to use and wonderful tutorial. Even words are not enough to express my gratitude, you have made a daunting task for every ML Enthusiast a hell lot easier !!!

  126. Avatar
    Brian June 6, 2017 at 11:11 pm #

    Is this machine learning? what does the machine learn in this example? This is just plain Statistics, used in a weird way…

    • Avatar
      Jason Brownlee June 7, 2017 at 7:14 am #

      Yes, it is.

      Nominally, statistics is about understanding the data, machine learning about making predictions at the cost of understanding.

    • Avatar
      Raj June 9, 2017 at 2:22 am #

      your question can be answered like this…

      consider the formula for area of triangle 1/2 x base x height. When you learn this formula, you understand it and apply it many times for different triangles. BUT you did not learn anything ABOUT the formula itself. . for instance, how many people care that the formula has 2 variables(base and height) and that there is no CONSTANT(like PI) in the formula and many such things about the formula itself? Applying the formula does not teach anything about the nature of the formula itself

      A lot of program execution in computers happen much the same way…data is a thing to be modified, applied or used, but not necessarily understood. When you introduce some techniques to understand data, then necessarily the computer or the ‘Machine’ ‘learns’ that there are characteristics about that data, and that at the least, there exists some relationship amongst data in their dataset. This learning is not explicitly programmed rather inferenced, although confusingly, the algorithms themselves are explicitly programmed to infer the meaning of the dataset. The learning is then transferred to the end cycle of making prediction based on the gained understanding of data.

      but like you pointed out, it is still statistics and all it’s domain techniques, but as a statistician do you not ‘learn’ more about data than merely use it, unlike your counterparts who see data more as a commodity to be consumed? Because most computer systems do the latter(consumption) rather than the former(data understanding), a system that understands data(with prediction used as a proof of learning) can be called ‘Machine Learning’.

  127. Avatar
    Alex June 7, 2017 at 6:04 am #

    Thanks for good tutorial Jason.

    Only issue I encountered is following error while cross validation score calculation for model KNeighborsClassifier() :

    AttributeError: ‘NoneType’ object has no attribute ‘issparse’

    Is somebody got same error? How it can be solved?

    I have installed following versions of toos:
    Python: 2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]
    scipy: 0.19.0
    numpy: 1.12.1
    matplotlib: 2.0.0
    pandas: 0.19.2
    sklearn: 0.18.1

    Thanks,
    Alex

    • Avatar
      Jason Brownlee June 7, 2017 at 7:27 am #

      Ouch, sorry I have not seen this issue. Perhaps search on stackoverflow?

  128. Avatar
    thanda June 8, 2017 at 6:31 pm #

    HI, Jason!
    How can i get the xgboost algorithm in pseudo code or in code?

  129. Avatar
    Shaksham Kapoor June 9, 2017 at 1:14 am #

    Sir,I’ve been working on bank_note authentication dataset and after applying the above procedure carefully the results were 100% accuracy(both on trained and validation dataset) using SVM and KNN models. Is 100% accuracy possible or have I done something wrong ?

    • Avatar
      Jason Brownlee June 9, 2017 at 6:27 am #

      That sounds great.

      If I were to get surprising results, I would be skeptical of my code/models.

      Work hard to ensure your system is not fooling you. Challenge surprising results.

      • Avatar
        Shaksham Kapoor June 9, 2017 at 3:10 pm #

        Sir, I’ve considered various other aspects like f1-score, recall, support ; but in each case the result is same 100%. How can I make sure that my system is not fooling me ? What other procedure can I apply to check the accuracy of my dataset ?

        • Avatar
          Jason Brownlee June 10, 2017 at 8:13 am #

          Get more data and see if the model can make accurate predictions.

  130. Avatar
    Rejeesh R June 9, 2017 at 7:27 pm #

    Hi, Jason!
    I am new to python as well ML. so I am getting the below error while running your code, please help me to code bring-up

    File “sample1.py”, line 73, in
    predictions = knn.predict(X_validation)
    File “/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py”, line 143, in predict
    X = check_array(X, accept_sparse=’csr’)
    File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”, line 407, in check_array
    _assert_all_finite(array)
    File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”, line 58, in _assert_all_finite
    ” or a value too large for %r.” % X.dtype)
    ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

    and my config

    Python: 2.7.6 (default, Oct 26 2016, 20:30:19)
    [GCC 4.8.4]
    scipy: 0.13.3
    numpy: 1.8.2
    matplotlib: 1.3.1
    pandas: 0.13.1
    sklearn: 0.18.1
    running in Ubuntu Terminal.

    • Avatar
      Jason Brownlee June 10, 2017 at 8:20 am #

      You may have a NaN value in your dataset. Check your data file.

  131. Avatar
    Sats S June 10, 2017 at 5:27 am #

    Hello. This is really an amazing tutorial. I got down to everything but when selecting the best model i hit a snag. Can you help out?

    Traceback (most recent call last):
    File “/Users/sahityasehgal/Desktop/py/machinetest.py”, line 77, in
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
    for train, test in cv_iter)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
    while self.dispatch_one_batch(iterator):
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
    self._dispatch(tasks)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
    result = ImmediateResult(func)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
    self.results = batch()
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/linear_model/logistic.py”, line 1173, in fit
    order=”C”)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py”, line 526, in check_X_y
    y = column_or_1d(y, warn=True)
    File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py”, line 562, in column_or_1d
    raise ValueError(“bad input shape {0}”.format(shape))
    ValueError: bad input shape (94, 4)

  132. Avatar
    Rene June 11, 2017 at 1:25 am #

    Very insightful Jason, thank you for the post!

    I was wondering if the models can be saved to/loaded from file, to avoid re-training a model each time we wish to make a prediction.

    Thanks,

    Rene

  133. Avatar
    Richard Bruning June 12, 2017 at 11:42 am #

    Mr. Brownlee,

    This is, by far, is the most effective applied technology tutorial I have utilized.

    You get right to the point and still have readers actually working with python, python libraries, IDE options, and of course machine learning. I am an electromechanical engineer with embedded C experience. Until now, I have been bogged down trying to traipse through python wizards’ idiosyncratic coding styles and verbose machine learning theory knowing there exists a friendlier path.

    Thank you for showing me the way!

    Rich

    • Avatar
      Jason Brownlee June 13, 2017 at 8:13 am #

      Thanks Rich, you made my day! I’m glad it helped.

  134. Avatar
    Praver Vats June 13, 2017 at 7:21 pm #

    This was very informative….Thank You !

    Actually I was working on a project on twitter analysis using python where I am extracting user interests through their tweets. I was thinking of using naive bayes classifier in textblob python library for training classifier with different type of pre-labeled tweets or different categories like politics,sports etc.
    My only concern is that will it be accurate as I tried passing like 10 tweets in training set and based on that I tried classifying my test set. I am getting some false cases and accuracy is around 85.

  135. Avatar
    Kush Singh Kushwaha June 14, 2017 at 4:14 am #

    Hi Jason,

    This was great example. I was looking for something similar on internet all this time,glad I found this link. I wanted to compile a ML code end-to-end and see my basic infra is ready to start with the actual course work. As you said, from here we can learn more about each algorithm in detail. It would be great if you can start a Youtube channel and upload some easy to learn videos as well related to ML, Deep learning and Neural Networks.

    Regards,
    Kush Singh

    • Avatar
      Jason Brownlee June 14, 2017 at 8:51 am #

      Thanks.

      Take a look at the rest of my blog and my books. I am dedicated to this mission.

  136. Avatar
    Shaksham Kapoor June 14, 2017 at 4:34 am #

    I’ve been working on a dataset which contains [Male,Female,Infant] as entries in first column rest all columns are integers. How can I replace [Male,Female,Infant] with a similar notation like [0,1,2] or something like that ? What is the most efficient way to do it ?

  137. Avatar
    Dev June 14, 2017 at 12:52 pm #

    Sir, while loading dataset we have given the URl but what if we already have one and wants to load it ?

  138. Avatar
    Vincent June 18, 2017 at 2:26 am #

    Hi,

    Nice tutorial, thanks!
    Just a little precision if someone encounter the same issue than me:
    if you get the error “This application failed to start because it could not find or load the Qt platform plugin “windows”
    in “”.” when you are trying to see your data visualizations, it’s maybe (like in my case) because you are using PySide rather than PyQT.
    In that case, add these lines before the “import matplotlib.pyplot as plt”:

    import matplotlib
    matplotlib.use(‘Qt4Agg’)
    matplotlib.rcParams[‘backend.qt4′]=’PySide’

    Hope this will help

  139. Avatar
    Danielle June 25, 2017 at 5:43 pm #

    Fantastic tutorial! Running today I noticed two changes from the tutorial above (undoubtably because time has passed since it was created). New users might find the following observations useful:

    #1 – Future Warning

    Ran on OS X, Python 3.6.1, in a jupyter notebook, anaconda 4.4.0 installed:
    scipy: 0.19.0
    numpy: 1.12.1
    matplotlib: 2.0.2
    pandas: 0.20.1
    sklearn: 0.18.1

    I replaced this line in the #Load libraries code block:
    from pandas.tools.plotting import scatter_matrix

    With this:
    from pandas.plotting import scatter_matrix

    …because a FutureWarning popped up:
    /Users/xxx/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: ‘pandas.tools.plotting.scatter_matrix’ is deprecated, import ‘pandas.plotting.scatter_matrix’ instead.

    Note: it does run perfectly even without this fix, this may be more of an issue in the future

    #2 – SVM wins!

    In the build models section, the results were:
    LR: 0.966667 (0.040825)
    LDA: 0.975000 (0.038188)
    KNN: 0.983333 (0.033333)
    CART: 0.966667 (0.040825)
    NB: 0.975000 (0.053359)
    SVM: 0.991667 (0.025000)

    … which means SVM was better here. I added the following code block based on the KNN one:
    # Make predictions on validation dataset
    svm = SVC()
    svm.fit(X_train, Y_train)
    predictions = svm.predict(X_validation)
    print(accuracy_score(Y_validation, predictions))
    print(confusion_matrix(Y_validation, predictions))
    print(classification_report(Y_validation, predictions))

    which gets these results:
    0.933333333333
    [[ 7 0 0]
    [ 0 10 2]
    [ 0 0 11]]
    precision recall f1-score support

    Iris-setosa 1.00 1.00 1.00 7
    Iris-versicolor 1.00 0.83 0.91 12
    Iris-virginica 0.85 1.00 0.92 11

    avg / total 0.94 0.93 0.93 30

    I did also run the unmodified KNN block – # Make predictions on validation dataset – and got the exact results that were in the tutorial.

    Excellent tutorial, very clear, and easy to modify 🙂

    • Avatar
      Jason Brownlee June 26, 2017 at 6:06 am #

      Thanks for sharing Danielle.

      • Avatar
        abhilash April 2, 2020 at 12:34 am #

        precision recall f1-score support

        Iris-setosa 1.00 1.00 1.00 7
        Iris-versicolor 1.00 0.83 0.91 12
        Iris-virginica 0.85 1.00 0.92 11

        how to relate this result with input ? I mean, can i interactively provide the values for sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width and result to get whether it which class ?

  140. Avatar
    mr. disapointed June 26, 2017 at 10:06 pm #

    So this intro shows how to set everything up but not the actual interesting bit how to use it?

  141. Avatar
    Aditya June 28, 2017 at 4:48 pm #

    Excellent tutorial sir, I love your tutorials and I am starting with deep learning with keras.
    I would love if you could provide a tutorial for sequence to sequence model using keras and a relevant dataset.
    Also I would be obliged if you could point me in some direction towards names entity recognition using seq2seq

  142. Avatar
    RATNA June 30, 2017 at 4:19 am #

    Hi Jason,

    Awesome tutorial. I am working on PIMA dataset and while using the following command
    # head
    print(dataset.head(20))

    I am getting NAN. HEPL ME.

    • Avatar
      Jason Brownlee June 30, 2017 at 8:18 am #

      Confirm you downloaded the dataset and that the file contains CSV data with nothing extra or corrupted.

      • Avatar
        RATNA June 30, 2017 at 4:14 pm #

        Hi Jason,

        I downloaded the dataset from UCI which is a CSV file but still I get NAN.

        # Load dataset url = “https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data”

        Thanks..

        • Avatar
          Jason Brownlee July 1, 2017 at 6:27 am #

          Sorry, I do not see how this could be. Perhaps there is an issue with your environment?

  143. Avatar
    Deepak July 2, 2017 at 1:50 am #

    Hello Jason,
    Thank you for a great tutorial.

    I have noticed something , which I would like to share with you.

    I have tried with random_state = 4
    “X_train,X_validation,Y_train,Y_validation = model_selection.train_test_split(X,Y, test_size = 0.2, random_state = 4)”

    and surprisingly now “LDA” has the best accuracy.

    LR: 0.966667 (0.040825)
    LDA: 0.991667 (0.025000)
    KNN: 0.975000 (0.038188)
    CART: 0.958333 (0.055902)
    NB: 0.950000 (0.055277)
    SVM: 0.983333 (0.033333)

    Any thoughts on this?

  144. Avatar
    Rui July 3, 2017 at 12:31 pm #

    Hi Jason,

    Thanks for your great example, this is really helpful, this end-to-end project is the best way to learn ML, much better than text-book which they only focus on the seperate concepts, not the whole forest, will you please do more example like this and explain in detail next time?

    Thanks,

    Rui

  145. Avatar
    Vaibhav July 4, 2017 at 4:33 pm #

    __init__() got an unexpected keyword argument ‘n_splites’

    I am getting this error while running the code upto “print(msg)” commmand.
    Can you please help me removing it.

  146. Avatar
    Fahad Ahmed July 5, 2017 at 12:31 am #

    This is beautiful tutorial for the starters..
    I am a lover of machine learning and want to do some projects and research on it.
    I would really need your help and guideline time to time.

    Regards,
    Fahad

  147. Avatar
    Neal Valiant July 12, 2017 at 9:08 am #

    Hi Jason,
    Love the article. gave me a good start of understanding machine learning. One thing i would like to ask is what is the predicted outcome? Is it which type or “class” of flower that will happen next? i assume switching things up I could use this same outline as a way of getting a prediction on the other columns involved?

    • Avatar
      Jason Brownlee July 12, 2017 at 9:55 am #

      Yes, the prediction is a number that maps to a specific class of flower (string).

      Correct, from the class and other measures you could predict width or something.

      • Avatar
        Neal July 13, 2017 at 3:50 am #

        Hi again Jason,
        Diving deeper into this tutorial and analyzing more I find something that peaked an interest maybe you can shed light on. based off the seed of 7 you get a higher accuracy percentage on the KNN algorithm after using kfold, but when showing the information for the LDA algorithm, it has a higher percentage in accuracy_score after predicting on it. what could this mean?

        • Avatar
          Jason Brownlee July 13, 2017 at 9:59 am #

          Machine learning algorithms are stochastic.

          It is important to develop a robust estimate of the performance of machine learning models on unseen data using repeats. See this post:
          https://machinelearningmastery.mystagingwebsite.com/evaluate-skill-deep-learning-models/

          • Avatar
            Neal July 13, 2017 at 11:22 am #

            Another great read Jason. This whole site is full of great pieces and it gives me a good answer on my question. I want to thank you for your time and effort into making such a great place for all this knowledge.

          • Avatar
            Jason Brownlee July 13, 2017 at 4:54 pm #

            Thanks, I’m glad it helps Neal. Stick with it!

  148. Avatar
    Thomas July 14, 2017 at 8:10 pm #

    Hello Jason,

    At the beginning of your tutorial you write: “If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.”
    No offense but in this regards, your tutorial is not doing a very good job.
    You don’t really go in detail so that we can understand what is been done and why. The explanations are rather weak.
    Wrong expectations set i believe.

    Cheers,

    Thomas

    • Avatar
      Jason Brownlee July 15, 2017 at 9:43 am #

      It is a starting point, not a panacea.

      Sorry that it’s not a good fit for you.

  149. Avatar
    Mariah July 15, 2017 at 7:11 am #

    Hi Jason! I am trying to adapt this for a purely binary dataset, however I’m running into this problem:
    # evaluate each model in turn
    results = []
    name = []
    for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X_train, Y_train,cv = kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s:%f(%f)”%(name, cv_results.mean(), cv_results.std())
    print(msg)

    I get the error:

    raise ValueError(“Unknown label type: %r” % y_type)

    ValueError: Unknown label type: ‘unknown’

    Am I missing something, any help would be great!

    • Avatar
      Mariah July 15, 2017 at 7:12 am #

      All necessary indentations are correct, it just pasted incorrectly

    • Avatar
      Jason Brownlee July 15, 2017 at 9:46 am #

      Sorry, the fault is not obvious to me.

    • Avatar
      Daniel September 12, 2017 at 1:14 am #

      Hello Mariah,

      Did you ever get a solution to this problem?

      Jason..great guide here..THANKS!

  150. Avatar
    Sreeram July 16, 2017 at 10:09 pm #

    Hi. What should i do to make predictions based on my own test set.? Say i need to predict category of flower with data [5.2, 1.8, 1.6, 0.2]. ie i want to change my X_test to that array. And the prediction should be like “setosa”.

    What changes should i do.? I tried giving that value directly to predict(). But it crashes.

    • Avatar
      Jason Brownlee July 17, 2017 at 8:47 am #

      Correct.

      Fit the model on all available data. This is called creating a final model:
      https://machinelearningmastery.mystagingwebsite.com/train-final-machine-learning-model/

      Then make your prediction on new data where you do not know the answer/outcome.

      Does that help?

      • Avatar
        Sreeram July 18, 2017 at 2:35 am #

        Yes it helped. Can u show an example code for the same.?

        • Avatar
          Jason Brownlee July 18, 2017 at 8:46 am #

          Sure:

  151. Avatar
    Joe July 18, 2017 at 7:49 am #

    Hi Jason, i´m perú and i have to script write in Mac
    #Configurar para la red neural
    fechantinicio = ‘1970-01-01’
    fechantfinal = ‘1974-12-31’
    capasinicio = TodasEstaciones.ix[fechantinicio:fechantfinal].as_matrix()[:,[0,2,5]]
    capasalida = TodasEstaciones.ix[fechantinicio:fechantfinal].as_matrix()[:,1]
    #Construimos la Red Neural

    from sknn.mlp import Regressor, Layer

    neurones = 8
    tasaaprendizaje = 0.0001
    numiteraciones = 7000

    #Definition of the training for the neural network
    redneural = Regressor(
    layers=[
    Layer(“ExpLin”, units=neurones),
    Layer(“ExpLin”, units=neurones), Layer(“Linear”)],
    learning_rate=tasaaprendizaje,
    n_iter=numiteraciones)
    redneural.fit(capasinicio, capasalida)

    #Get the prediction for the train set
    valortest = ([])

    for i in range(capasinicio.shape[0]):
    prediccion = redneural.predict(np.array([capasinicio[i,:].tolist()]))
    valortest.append(prediccion[0][0])

    and then run…
    ModuleNotFoundError Traceback (most recent call last)
    in ()
    1 #Construimos la Red Neural
    2
    —-> 3 from sknn.mlp import Regressor, Layer
    4
    5

    ModuleNotFoundError: No module named ‘sknn’
    i have install python in window 7 and i changed the script so:

    #construimos la red neural
    import numpy as np
    from sklearn.neural_network import MLPRegressor

    #definicion del entrenamiento para el trabajo de la red neural

    redneural = MLPRegressor(
    hidden_layer_sizes=(100,), activation=’relu’, solver=’adam’, alpha=0.001, batch_size=’auto’,
    learning_rate=’constant’, learning_rate_init=0.01, power_t=0.5, max_iter=1000, shuffle=True,
    random_state=0, tol=0.0001, verbose=False, warm_start=False, momentum=0.9, nesterovs_momentum=True,
    early_stopping=False, validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)

    redneural.fit(capasinicio,capasalida) and then shift + enter the run never end.

    Thanks for your time.

  152. Avatar
    Angel July 18, 2017 at 6:06 pm #

    Hello Jason, this is a fantastic tutorial! I am using this as a template to experiment with a dataset that has 0 or 1 as a value for each attribute and keep running into this error:

    # Load libraries
    import numpy
    from matplotlib import pyplot
    from pandas import read_csv
    from pandas import set_option
    from pandas.tools.plotting import scatter_matrix
    from sklearn.preprocessing import StandardScaler
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import KFold
    from sklearn.model_selection import cross_val_score
    from sklearn.model_selection import GridSearchCV
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    from sklearn.metrics import accuracy_score
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LogisticRegression
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.neighbors import KNeighborsClassifier
    from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
    from sklearn.naive_bayes import GaussianNB
    from sklearn.svm import SVC
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.ensemble import GradientBoostingClassifier
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.ensemble import ExtraTreesClassifier
    # Load Dataset
    filename = ‘ML.csv’
    names = [‘Cities’, ‘Entertainment’, ‘RegionalFood’, ‘WestMiss’, ‘NFLTeam’, ‘Coastal’, ‘WarmWinter’, ‘SuperBowl’, ‘Manufacturing’]
    data = read_csv(filename, names=names)
    print(data.shape)
    # types
    set_option(‘display.max_rows’, 500)
    print(data.dtypes)
    # head
    set_option(‘display.width’, 100)
    print(data.head(20))
    # descriptions, change precision to 3 places
    set_option(‘precision’, 3)
    print(data.describe())
    # class distribution
    print(data.groupby(‘Cities’).size())
    # histograms
    data.hist(sharex=False, sharey=False, xlabelsize=1, ylabelsize=1)
    pyplot.show()
    # correlation matrix
    fig = pyplot.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(data.corr(), vmin=-1, vmax=1, interpolation=’none’)
    fig.colorbar(cax)
    pyplot.show()
    # Split-out validation dataset
    array = data.values
    X = array[:,1:8]
    Y = array[:,8]
    validation_size = 0.20
    seed = 7
    X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y,
    test_size=validation_size, random_state=seed)
    # Test options and evaluation metric
    num_folds = 3
    seed = 7
    scoring = ‘accuracy’
    # Spot-Check Algorithms
    models = []
    models.append((‘CART’, DecisionTreeClassifier()))
    models.append((‘NB’, GaussianNB()))
    models.append((‘SVM’, SVC()))
    results = []
    names = []
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = KFold(n_splits=3, random_state=seed)
    cv_results =cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    print(msg)

    I get the following error:

    File “C:\Users\Giselle\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py”, line 172, in check_classification_targets
    raise ValueError(“Unknown label type: %r” % y_type)

    ValueError: Unknown label type: ‘unknown’

    runfile(‘C:/Users/Giselle/.spyder-py3/temp.py’, wdir=’C:/Users/Giselle/.spyder-py3′)

  153. Avatar
    machine learning guy July 18, 2017 at 9:15 pm #

    hey jason.

    awesome detailed blog man…..i always love your method for explanation ..so clean and easy. Great … i start machine learning with r but now doing with python too.

    Regards

    Kuldeep

  154. Avatar
    Aayush A July 18, 2017 at 9:17 pm #

    Hey Jason,

    Your sample code is amazing to get started with ML.

    When I tried to run the code myself I get an

    Can you please help me rectify this?

  155. Avatar
    Marco Roque July 19, 2017 at 7:01 am #

    Jason

    Thanks for your help !!!! The Blog is super useful … do you have another place that you recommend to learn more about the topic …. Thanks !!!!

    Best

    Marco

  156. Avatar
    Yug July 20, 2017 at 2:59 am #

    Hi Jason,
    Great tutorial!! very helpful!

    I am getting an error executing below piece of code, can you help?
    # evaluate each model in turn
    results = []
    names = []
    for name, model in models:
    kfold = ms.KFold(n_splits=10, random_state=seed)
    cv_results = ms.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
    print(msg)

    Error that I am getting:
    TypeError: get_params() missing 1 required positional argument: ‘self’

    • Avatar
      Jason Brownlee July 20, 2017 at 6:22 am #

      Sorry, I have not seen that error before. Perhaps confirm that your environment is installed correctly?

      Also confirm that you have all of the code without extra spaces?

      • Avatar
        Yug July 20, 2017 at 8:02 am #

        Yeah, environment is installed correctly. I made sure that there are no extra spaces in the code. It is still erroring out.

    • Avatar
      Sal August 2, 2018 at 1:07 am #

      For anyone with this issue, the problem is a missing parenthesis in the line models.append((‘LR’, LogisticRegression()))