# Your First Machine Learning Project in Python Step-By-Step

Last Updated on August 19, 2020

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this post, you will complete your first machine learning project using Python.

In this step-by-step tutorial you will:

1. Download and install Python SciPy and get the most useful package for machine learning in Python.
2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
3. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started!

• Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
• Update Sep/2018: Added link to my own hosted version of the dataset.
• Update Feb/2019: Updated for sklearn v0.20, also updated plots.
• Update Nov/2019: Added full code examples for each section.
• Update Dec/2019: Updated examples to remove warnings due to API changes in v0.22.
• Update Jan/2020: Updated to remove the snippet for the test harness.

Your First Machine Learning Project in Python Step-By-Step
Photo by cosmoflash, some rights reserved.

## How Do You Start Machine Learning in Python?

The best way to learn machine learning is by designing and completing small projects.

### Python Can Be Intimidating When Getting Started

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.

There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

• It will force you to install and start the Python interpreter (at the very least).
• It will given you a bird’s eye view of how to step through a small project.
• It will give you confidence, maybe to go on to your own small projects.

### Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

A machine learning project may not be linear, but it has a number of well known steps:

1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

### Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

• Attributes are numeric so you have to figure out how to load and handle data.
• It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
• All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in Python.

## Machine Learning in Python: Step-By-Step Tutorial (start here)

In this section, we are going to work through a small machine learning project end-to-end.

Here is an overview of what we are going to cover:

1. Installing the Python and SciPy platform.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

### Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Get the Python and SciPy platform installed on your system if it is not already.

I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.

### 1.1 Install SciPy Libraries

This tutorial assumes Python version 2.7 or 3.6+.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

• scipy
• numpy
• matplotlib
• pandas
• sklearn

There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.

• On Mac OS X, you can use macports to install Python 3.6 and these libraries. For more information on macports, see the homepage.
• On Linux you can use your package manager, such as yum on Fedora to install RPMs.

If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.

Note: This tutorial assumes you have scikit-learn version 0.20 or higher installed.

Need more help? See one of these tutorials:

### 1.2 Start Python and Check Versions

It is a good idea to make sure your Python environment was installed successfully and is working as expected.

The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.

Open a command line and start the python interpreter:

I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.

Type or copy and paste the following script:

Here is the output I get on my OS X workstation:

Compare the above output to your versions.

Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.

If you get an error, stop. Now is the time to fix it.

If you cannot run the above script cleanly you will not be able to complete this tutorial.

My best advice is to Google search for your error message or post a question on Stack Exchange.

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

In this step we are going to load the iris data from CSV file URL.

### 2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.

We can load the data directly from the UCI Machine Learning repository.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

The dataset should load without incident.

If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.

## 3. Summarize the Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

### 3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 150 instances and 5 attributes:

### 3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 20 rows of the data:

### 3.3 Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

### 3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

We can see that each class has the same number of instances (50 or 33% of the dataset).

### 3.5 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

## 4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

1. Univariate plots to better understand each attribute.
2. Multivariate plots to better understand the relationships between attributes.

### 4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots for Each Input Variable for the Iris Flowers Dataset

We can also create a histogram of each input variable to get an idea of the distribution.

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Histogram Plots for Each Input Variable for the Iris Flowers Dataset

### 4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Scatter Matrix Plot for Each Input Variable for the Iris Flowers Dataset

### 4.3 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

## 5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

1. Separate out a validation dataset.
2. Set-up the test harness to use 10-fold cross validation.
3. Build multiple different models to predict species from flower measurements
4. Select the best model.

### 5.1 Create a Validation Dataset

We need to know that the model we created is good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train, evaluate and select among our models, and 20% that we will hold back as a validation dataset.

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.

Notice that we used a python slice to select the columns in the NumPy array. If this is new to you, you might want to check-out this post:

### 5.2 Test Harness

We will use stratified 10-fold cross validation to estimate model accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.

For more on the k-fold cross-validation technique, see the tutorial:

We set the random seed via the random_state argument to a fixed number to ensure that each algorithm is evaluated on the same splits of the training dataset.

We are using the metric of ‘accuracy‘ to evaluate models.

This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

### 5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use.

We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s test 6 different algorithms:

• Logistic Regression (LR)
• Linear Discriminant Analysis (LDA)
• K-Nearest Neighbors (KNN).
• Classification and Regression Trees (CART).
• Gaussian Naive Bayes (NB).
• Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.

Let’s build and evaluate our models:

### 5.4 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

What scores did you get?

In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score at about 0.98 or 98%.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).

A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.

We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.

Box and Whisker Plot Comparing Machine Learning Algorithms on the Iris Flowers Dataset

### 5.5 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

## 6. Make Predictions

We must choose an algorithm to use to make predictions.

The results in the previous section suggest that the SVM was perhaps the most accurate model. We will use this model as our final model.

Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both of these issues will result in an overly optimistic result.

### 6.1 Make Predictions

We can fit the model on the entire training dataset and make predictions on the validation dataset.

You might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:

You might also like to save the model to file and load it later to make predictions on new data. For examples on how to do this, see the tutorial:

### 6.2 Evaluate Predictions

We can evaluate the predictions by comparing them to the expected results in the validation set, then calculate classification accuracy, as well as a confusion matrix and a classification report.

We can see that the accuracy is 0.966 or about 96% on the hold out dataset.

The confusion matrix provides an indication of the errors made.

Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

### 6.3 Complete Example

For reference, we can tie all of the previous elements together into a single script.

The complete example is listed below.

## You Can Do Machine Learning in Python

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.

## Summary

In this post, you discovered step-by-step how to complete your first machine learning project in Python.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Do you work through the tutorial?

1. Work through the above tutorial.
2. List any questions you have.
3. Search-for or research the answers.
4. Remember, you can use the help(“FunctionName”) in Python to get help on any function.

Do you have a question?
Post it in the comments below.

### More Tutorials?

Looking to continue to practice your machine learning skills, take a look at some of these tutorials:

## Discover Fast Machine Learning in Python!

#### Develop Your Own Models in Minutes

...with just a few lines of scikit-learn code

Learn how in my new Ebook:
Machine Learning Mastery With Python

Covers self-study tutorials and end-to-end projects like:

### 2,004 Responses to Your First Machine Learning Project in Python Step-By-Step

1. DR Venugopala Rao Manneni June 11, 2016 at 5:58 pm #

Awesome… But in your Blog please introduce SOM ( Self Organizing maps) for unsupervised methods and also add printing parameters ( Coefficients )code.

• Jason Brownlee June 14, 2016 at 8:17 am #

I generally don’t cover unsupervised methods like clustering and projection methods.

This is because I mainly focus on and teach predictive modeling (e.g. classification and regression) and I just don’t find unsupervised methods that useful.

• Rajesh January 21, 2018 at 5:33 pm #

Jason,
Can you elaborate what you don’t find unsupervised methods useful?

• Jason Brownlee January 22, 2018 at 4:42 am #

Because my focus is predictive modeling.

• hamdy November 19, 2018 at 8:04 am #

DeprecationWarning: the imp module is deprecated in favour of importlib; see the module’s documentation for alternative uses
what is the error?

• Jason Brownlee November 19, 2018 at 2:19 pm #

You can ignore this warning for now.

• Haider June 16, 2019 at 7:23 pm #

# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC(gamma=’auto’)))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)

ValueError Traceback (most recent call last)
in
13 for name, model in models:
14 kfold = model_selection.KFold(n_splits=10, random_state=seed)
—> 15 cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
400 fit_params=fit_params,
401 pre_dispatch=pre_dispatch,
–> 402 error_score=error_score)
403 return cv_results[‘test_score’]
404

~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
238 return_times=True, return_estimator=return_estimator,
239 error_score=error_score)
–> 240 for train, test in cv.split(X, y, groups))
241
242 zipped_scores = list(zip(*scores))

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
915 # remaining jobs.
916 self._iterating = False
–> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
760 return True
761

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
–> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 “””Schedule a func to be run”””
–> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don’t delay the application, to avoid keeping the input
548 # arguments in memory
–> 549 self.results = batch()
550
551 def get(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
–> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in (.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
–> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):

~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
526 estimator.fit(X_train, **fit_params)
527 else:
–> 528 estimator.fit(X_train, y_train, **fit_params)
529
530 except Exception as e:

~\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1284 X, y = check_X_y(X, y, accept_sparse=’csr’, dtype=_dtype, order=”C”,
1285 accept_large_sparse=solver != ‘liblinear’)
-> 1286 check_classification_targets(y)
1287 self.classes_ = np.unique(y)
1288 n_samples, n_features = X.shape

~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
169 if y_type not in [‘binary’, ‘multiclass’, ‘multiclass-multioutput’,
170 ‘multilabel-indicator’, ‘multilabel-sequences’]:
–> 171 raise ValueError(“Unknown label type: %r” % y_type)
172
173

ValueError: Unknown label type: ‘continuous’

• Vaisakh Nair January 5, 2022 at 6:14 pm #

Thanks jason ur teachings r really helpful more power to u thanks a ton…learning lots of predictive modelling from ur pages!!!

• James Carmichael January 6, 2022 at 10:51 am #

Thank you for your kind words and feedback, Vaisakh!

• Rasmi Bhattarai June 3, 2020 at 4:16 pm #

RandomForestClassifier : 1.0

• Aishwarya April 11, 2018 at 1:49 pm #

I got quite different results though i used same seed and splits

Svm : 0.991667 (0.025) with highest accuracy
KNN : 0.9833
CART : 0.9833
Why ?

• Aishwarya April 11, 2018 at 1:59 pm #

Im getting error saying

Cannot perform reduce with flexible type

While comparing algos using boxplots

• Jason Brownlee April 11, 2018 at 4:26 pm #

Sorry, I have not seen this error before. Are you able to confirm that your environment is up to date?

• Ycyusa August 5, 2018 at 9:31 am #

I followed your steps and I got the similar result as Aishwarya

SVM: 0.991667 (0.025000)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)

• Jason Brownlee April 11, 2018 at 4:25 pm #

The API may have changed since I wrote this post. This in turn may have resulted in small changes in predictions that are perhaps not statistically significant.

• Aishwarya April 11, 2018 at 10:50 pm #

Ive done this on kaggle.
Under ML kernal

http://Www.kaggle.com/aishuvenkat09

• Aishwarya April 11, 2018 at 10:54 pm #
• Jason Brownlee April 12, 2018 at 8:43 am #

Well done!

• manohar April 23, 2018 at 6:49 pm #

Hi ,
I have same issues with above our friends discussed
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.983333 (0.033333)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)

In that svm has more accuracy when comapre to rest

• Jason Brownlee April 24, 2018 at 6:26 am #

Yes.

• Ali May 10, 2018 at 8:58 am #

Yes. I got the same. Dr. Jason had mentioned that results might vary.

• Sai Prasad September 14, 2018 at 5:08 pm #

I also have the same result.
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.983333 (0.033333)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)

• bharat May 19, 2018 at 9:45 pm #

cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

sir i am getting error in this in of code.What should i do?

• Jason Brownlee May 20, 2018 at 6:38 am #

What error?

• sawsen November 12, 2019 at 8:38 pm #

File “”, line 1, in
NameError: name ‘model’ is not defined

• Jason Brownlee November 13, 2019 at 5:40 am #

Looks like you may have missed a few lines of code.

Perhaps try copy-pasting the complete example at the end of each section?

• AVNEESH UPADHAYAY June 25, 2018 at 5:00 am #

I think cv may be equal to the number of times you want to perform k-fold cross validation for e.g. 10,20etc. and in scoring parameter, you need to mention which type of scoring parameter you want to use for example ‘accuracy’.
Hope this might help….

• Ved Anshu September 21, 2018 at 4:20 pm #

Bro kindly use train_test_split() in the place of model_selection

• David H. October 17, 2019 at 10:36 am #

Try this
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=None)

It worked for me!

• Bibhu Das December 11, 2019 at 12:57 am #

put the kfold = , and cv_results = , part inside the for loop it will work fine.

• Mohammed March 25, 2019 at 2:54 pm #

thank you so much really its very useful

in the last step you are used KNN to make predictions why you are used KNN can we use SVM
and can we make compare with all the models in predictions ?

• Jason Brownlee March 26, 2019 at 7:58 am #

It is just an example, you can make predictions with any model you wish.

Often we prefer simpler models (like knn) over more complex models (like svm).

• TAPSOBA Abdou March 20, 2020 at 11:17 pm #

Hi Jason
I followed your steps but I’m getting error. What should I do? Best regards
>>> # Spot Check Algorithms
… models = []
>>> models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
>>> models.append((‘LDA’, LinearDiscriminantAnalysis()))
>>> models.append((‘KNN’, KNeighborsClassifier()))
>>> models.append((‘CART’, DecisionTreeClassifier()))
>>> models.append((‘NB’, GaussianNB()))
>>> models.append((‘SVM’, SVC(gamma=’auto’)))
>>> # evaluate each model in turn
… results = []
>>> names = []
>>> for name, model in models:
… kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
File “”, line 2
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
^
IndentationError: expected an indented block
>>> cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘model’ is not defined
>>> results.append(cv_results)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘cv_results’ is not defined

• Dario Gomez January 3, 2021 at 3:25 pm #

Could you elaborate a bit more about the difference between prediction and projection?

For example I got a data set that I collected throughout a year, and I would like to predict/project what will happen next year.

• Shantanu Bhayre March 22, 2021 at 3:27 am #

sir i want to work on crop prices data for crop price pridiction project for my minor project but the crop price data does not find plese help me sir and send me crop price csv file link

• Sophie May 4, 2021 at 4:39 am #

Hello Jason,
Thank you for this amazing tutorial, it helped me to gain confidence:
LR: 0.941667 (0.065085)
LDA: 0.975000 (0.038188)
KNN: 0.958333 (0.041667)
NB: 0.950000 (0.055277)
SVM: 0.983333 (0.033333)

predictions: [‘Iris-setosa’ ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’
‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-virginica’ ‘Iris-setosa’
‘Iris-setosa’ ‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-setosa’
‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’
‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’ ‘Iris-setosa’
‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-virginica’ ‘Iris-setosa’
‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-setosa’ ‘Iris-setosa’
‘Iris-versicolor’ ‘Iris-virginica’]
0.9666666666666667
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 0.92 0.96 13
Iris-virginica 0.86 1.00 0.92 6

accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30

• Stone Bridge August 10, 2021 at 4:24 pm #

The program runs through, but the calculated result is that CART and SVM have the highest accuracy
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.053359)
KNN: 0.983333 (0.050000)
CART: 0.991667 (0.025000)
NB: 0.975000 (0.038188)
SVM: 0.991667 (0.025000)

• Adrian Tam August 11, 2021 at 6:39 am #

Nice work. Thanks.

• Hasnain July 8, 2017 at 8:55 pm #

I have installed all libraries that were in your How to Setup Python environment… blog. All went fine but when i run the starting imports code I get error at first line “ModuleNotFoundError: No module named ‘pandas'”. But I did installl it using “pip install pandas” command. I am working on a windows machine.

• Jason Brownlee July 9, 2017 at 10:53 am #

Sorry to hear that. Consider rebooting your machine?

• Sheila Dawn August 9, 2017 at 5:43 am #

Then I decided to put the two commands in one python file, it solved problem. 🙂

• Jason Brownlee August 9, 2017 at 6:43 am #

Yes, all commands go in the one file. Sorry for the confusion.

• Dan Fiorino July 16, 2017 at 2:37 am #

Hasnain, try setting the environment variable PYTHON_PATH and PATH to include the path to the site packages of the version of python you have permission to alter

export PYTHONPATH=”$PYTHONPATH:/path/to/Python/2.7/site-packages/” export PATH=”$PATH:/path/to/Python/2.7/site-packages/”

obviously replacing “/path/to” with the actual path. My system Python is in my /Users//Library folder but I’m on a Mac.

You can add the export lines to a script that runs when you open a terminal (“~/.bash_profile” if you use BASH).

• Jason Brownlee July 16, 2017 at 8:00 am #

Thanks for posting the tip Dan, I hope it helps.

• Jason Robinette September 7, 2017 at 11:16 am #

got it to work have no idea how but it worked! I am like the kid at t-ball that closes his eyes and takes a swing!

• Jason Brownlee September 7, 2017 at 12:58 pm #

• Tanya September 30, 2017 at 11:08 am #

I am starting at square 0, and after clearing a first few hurdles, I was not even able to install the libraries at all… (as a newb), I didn’t see where I even GO to import this:
import pandas
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

• Jason Brownlee October 1, 2017 at 9:04 am #

https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/

• KASINATH PS December 7, 2017 at 8:16 pm #

if u r using python 3

save all the commands as a py file
then in a pythin shell enter

if u open shell in the same path as the saved thing
then u only need to enter the filename alone

ex:
lets say i saved it as load.py

then

this will execute all commands in the current shell

• Rahul December 7, 2017 at 10:28 pm #

Hi Tanya,
This tutorial is so intuitive that I went through this tutorial with a breeze.
Install PIP (The de-facto python package manager) and then click “Terminal” in PyCharm to bring up the interactive DOS like terminal. Once you have installed PIP then there you can issue the following commands:
pip install numpy
pip install scipy
pip install matplotlib
pip install pandas
pip install sklearn
All other steps in the tutorial are valid and do not need a single line of change apart from where its mentioned

from pandas.tools.plotting import scatter_matrix , change it to

from pandas.plotting import scatter_matrix

• Jason Brownlee December 8, 2017 at 5:39 am #

Thanks for the tips Rahul.

• Murtaza December 17, 2017 at 11:05 am #

For a beginner i believe Anacondas Jupyter notebooks would be the best option. As they can include markdown for future reference which is essential as beginner (backpropogation :p). But again varies person to person

• Jason Brownlee December 18, 2017 at 5:19 am #

I find notebooks confuse beginners more than help.

Running a Python script on the command line is so much simpler.

• Jason March 1, 2018 at 4:18 pm #

Except for me, on Debian Stretch with pandas 0.19.2, I had to use

from pandas.tools.plotting import scatter_matrix

• Jason Brownlee March 2, 2018 at 5:30 am #

You must update your version of Pandas.

• avanish March 25, 2018 at 7:11 pm #

use jupyter notebook …there all the essential libraries are preinstalled

• Anmoldeep1509 October 31, 2018 at 6:50 am #

I also did a similar mistake, I am also a newbie to python, and wrote those import statements in the separate file, and imported the created file, without knowing how imports work…after your reply realized my mistake and now back on track thanks!

• Tushar June 22, 2018 at 4:50 am #

I also had problems installing modules on windows. Although, there was no error of any kind if installed from PyCharm IDE.
Also, use 32-bit python interpreter if you wanna use NLTK. It can be done even on 64-bit version, but was not worth the time it would it need.

• Karan sing March 26, 2019 at 8:28 pm #

If you are working on virtual environment then you have to make script first and run it by activating the virtual environment,
If you are not working on virtual environment then run your scripts on time

• Yuvraj July 13, 2018 at 1:56 am #

Could you please go into the mathematical concept behind KNN and why the accuracy resulted in the highest score? Thank you

• Mario October 4, 2018 at 8:13 pm #

I like your tutorial for the machine learning in python but at this moment I am stuck. Here is where I am
# Compare Algorithms
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison’)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

This is the answer I am getting from it

TypeError Traceback (most recent call last)
in ()
3 fig.suptitle(‘Algorithm Comparison’)
—-> 5 plt.boxplot(results)
6 ax.set_xticklabels(names)
7 plt.show()

~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder, hold, data)
2846 whiskerprops=whiskerprops,
2847 manage_xticks=manage_xticks, autorange=autorange,
-> 2848 zorder=zorder, data=data)
2849 finally:
2850 ax._hold = washold

~\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1853 “the Matplotlib list!)” % (label_namer, func.__name__),
1854 RuntimeWarning, stacklevel=2)
-> 1855 return func(ax, *args, **kwargs)
1856

~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder)
3555
3556 bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
-> 3557 labels=labels, autorange=autorange)
3558 if notch is None:
3559 notch = rcParams[‘boxplot.notch’]

~\Anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
1839
1840 # arithmetic mean
-> 1841 stats[‘mean’] = np.mean(x)
1842
1843 # medians and quartiles

~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in mean(a, axis, dtype, out, keepdims)
2955
2956 return _methods._mean(a, axis=axis, dtype=dtype,
-> 2957 out=out, **kwargs)
2958
2959

~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
68 is_float16_result = True
69
—> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
71 if isinstance(ret, mu.ndarray):
72 ret = um.true_divide(

TypeError: cannot perform reduce with flexible type

HOW CAN I FIX THIS?

• Jason Brownlee October 5, 2018 at 5:33 am #

Perhaps post your code and error to stackoverflow.com?

• Brandon January 23, 2019 at 4:37 pm #

I also got a traceback on this section:
TypeError: cannot perform reduce with flexible type

Quick check on stackoverflow show’s that plt.boxplot() cannot accept strings. Personally, I had an error in section 5.4 line 15.

Wrong code: results.append(results)
Coorect: resilts.append(cv_results)

woohoo for tracebacks and wrong data-types. Hope someone finds this helpful.

• Jason Brownlee January 24, 2019 at 6:40 am #

Are you able to confirm that your python libraries are up to date?

• Ademola November 27, 2018 at 7:49 am #

Well done

• Meca April 1, 2021 at 12:38 am #

Thank you sir!

2. Jan de Lange June 20, 2016 at 10:43 pm #

Nice work Jason. Of course there is a lot more to tell about the code and the Models applied if this is intended for people starting out with ML (like me). Rather than telling which “button to press” to make work, it would be nice to know why also. I looked at a sample of you book (advanced) if you are covering the why also, but it looks like it’s limited?

On this particular example, in my case SVM reached 99.2% and was thus the best Model. I gather this is because the test and training sets are drawn randomly from the data.

• Jason Brownlee June 21, 2016 at 7:04 am #

This tutorial and the book are laser focused on how to use Python to complete machine learning projects.

They already assume you know how the algorithms work.

If you are looking for background on machine learning algorithms, take a look at this book:
https://machinelearningmastery.com/master-machine-learning-algorithms/

• Alan July 26, 2017 at 10:50 pm #

Jan de Lange and Jason,

Before anything else, I truly like to thank Jason for this wonderful, concise and practical guideline on using ML for solving a predictive problem.

In terms of the example you have provided, I can confirm ‘Jan de Lange’ ‘s outcome. I’ve got the same accuracy result for SVM (0.991667 to be precise). I’ve just upgraded the Canopy version I had installed on my machine to version 2.1.3.3542 (64 bit) and your reasoning makes sense that this discrepancy could be because of its random selection of data. But this procedure could open up a new ‘can of warm’ as some say. since the selection of best model is on the line.

Thank you again Jason for this practical article on ML.

• Per December 15, 2017 at 7:36 pm #

Got it working too, changing the scatter_matrix import like Rahul did.
But I also had to install tkinter first (yum install tkinter).

Very nice tutorial, Jason!

3. Nil June 25, 2016 at 12:42 am #

Awesome, I have tested the code it is impressive. But how could I use the model to predict if it is Iris-setosa or Iris-versicolor or Iris-virginica when I am given some values representing sepal-length, sepal-width, petal-length and petal-width attributes?

• Jason Brownlee June 25, 2016 at 5:09 am #

Great question. You can call model.predict() with some new data.

For an example, see Part 6 in the above post.

• JamieFox March 28, 2017 at 6:38 am #

Dear Jason Brownlee, I was thinking about the same question of Nil. To be precise I was wondering how can I know, after having seen that my model has a good fit, which values of sepal-length, sepal-width, petal-length and petal-width corresponds to Iris-setosa eccc..
For instance, if I have p predictors and two classes, how can I know which values of the predictors blend to one class or the other. Knowing the value of predictors allows me to use the model in the daily operativity. Thx

• Jason Brownlee March 28, 2017 at 8:27 am #

Not knowing the statistical relationship between inputs and outputs is one of the down sides of using neural networks.

• JamieFox March 29, 2017 at 7:03 am #

Hi Mr Jason Brownlee, thks for your answer. So all algorithms, such as SVM, LDA, random forest.. have this drawbacks? Can you suggest me something else?
Because logistic regression is not like this, or am I wrong?

• Jason Brownlee March 29, 2017 at 9:14 am #

All algorithms have limitations and assumptions. For example, Logistic Regression makes assumptions about the distribution of variates (Gaussian) and more:
https://en.wikipedia.org/wiki/Logistic_regression

Nevertheless, we can make useful models (skillful) even when breaking assumptions or pushing past limitations.

4. Sujon September 6, 2016 at 8:19 am #

Dear Sir,

It seems I’m in the right place in right time! I’m doing my master thesis in machine learning from Stockholm University. Could you give me some references for laughter audio conversation to CSV file? You can send me anything on [email protected]. Thanks a lot and wish your very best and will keep in touch.

5. Sujon September 6, 2016 at 8:32 am #

Sorry I mean laughter audio to CSV conversion.

• Jason Brownlee September 6, 2016 at 9:49 am #

Sorry, I have not seen any laughter audio to CSV conversion tools/techniques.

• Sujon May 10, 2017 at 1:02 pm #

Hi again, do you have any publication of this article “Your First Machine Learning Project in Python Step-By-Step”? Or any citation if you know? Thanks.

• Jason Brownlee May 11, 2017 at 8:28 am #

No, you can reference the blog post directly.

6. Roberto U September 19, 2016 at 9:17 am #

Sweet way of condensing monstrous amount of information in a one-way street. Thanks!

Just a small thing, you are creating the Kfold inside the loop in the cross validation. Then, you use the same seed to keep the comparison across predictors constant.

That works, but I think it would be better to take it out of the loop. Not only is more efficient, but it is also much immediately clearer that all predictors are using the same Kfold.

You can still justify the use of the seeds in terms of replicability; readers getting the same results on their machines.

Thanks again!

• Jason Brownlee September 20, 2016 at 8:27 am #

Great suggestion, thanks Roberto.

7. Francisco September 20, 2016 at 2:02 am #

Hello Jaso.
Thank you so much for your help with Machine Learning and congratulations for your excellent website.

I am a beginner in ML and DeepLearning. Should I download Python 2 or Python 3?

Thank you very much.

Francisco

• Jason Brownlee September 20, 2016 at 8:33 am #

I use Python 2 for all my work, but my students report that most of my examples work in Python 3 with little change.

8. ShawnJ October 11, 2016 at 5:24 am #

Jason,

Thank you so much for putting this together. I am been a software developer for almost two decades and am getting interested in machine learning. Found this tutorial accurate, easy to follow and very informative.

• Jason Brownlee October 11, 2016 at 7:24 am #

Thanks ShawnJ, I’m glad you found it useful.

9. Wendy G October 14, 2016 at 5:37 am #

Jason,

Thanks for the great post! I am trying to follow this post by using my own dataset, but I keep getting this error “Unknown label type: array ([some numbers from my dataset])”. So what’s the problem on earth, any possible solutions?

Thanks,

• Jason Brownlee October 14, 2016 at 9:08 am #

Hi Wendy,

Carefully check your data. Maybe print it on the screen and inspect it. You may have some string values that you may need to convert to numbers using data preparation.

10. fara October 20, 2016 at 7:15 am #

hi thanks for great tutorial, i’m also new to ML…this really helps but i was wondering what if we have non-numeric values? i have mixture of numeric and non-numeric data and obviously this only works for numeric. do you also have a tutorial for that or would you please send me a source for it? thank you

• Jason Brownlee October 20, 2016 at 8:41 am #

Great question fara.

We need to convert everything to numeric. For categorical values, you can convert them to integers (label encoding) and then to new binary features (one hot encoding).

11. Mazhar Dootio October 23, 2016 at 9:14 pm #

Hello Jason
Thank you for publishing this great machine learning tutorial.
It is really awesome awesome awesome………..!
I test your tutorial on python-3 and it works well but what I face here is to load my data set from my local drive. I followed your give instructions but couldn’t be successful.
My syntax is as under:

import unicodedata
names = [‘class’, ‘sno’, ‘gender’, ‘morphology’, ‘stem’,’fword’]

python-3 jupyter notebook does not loads this. Kindly help me in regard.

• Jason Brownlee October 24, 2016 at 7:05 am #

Hi Mazhar, thanks.

Are you able to load the file on the command line away from the notebook?

Perhaps the notebook environment is causing trouble?

• Kenny October 11, 2017 at 3:43 am #

Mazhar try this:

import pandas as pd
.
.
.

file= \”namefile.csv\” #or c:/____/___/

in Jupyter

12. Mazhar Dootio October 25, 2016 at 3:22 am #

Dear Jason
Thank you for response
I am using Python 3 with anaconda jupyter notebook
so which python version you would like to suggest me and kindly write here syntax of opening local dataset file from local drive that how can I load utf-8 dataset file from my local drive.

• Jason Brownlee October 25, 2016 at 8:32 am #

Hi Mazhar, I teach using Python 2.7 with examples from the command line.

Many of my students report that the code works in Python 3 and in notebooks with little or no changes.

• Kenny October 11, 2017 at 3:50 am #

try with this command:

df = pd.read_csv(file, encoding=’latin-1′)﻿ #if you are working with csv “,” or “;” put sep=’|’,

13. Andy October 27, 2016 at 11:59 pm #

Great tutorial but perhaps I’m missing something here. Let’s assume I already know what model to use (perhaps because I know the data well… for example).

knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)

I then use the models to predict:
print(knn.predict(an array of variables of a record I want to classify))

Is this where the whole ML happens?
knn.fit(X_train, Y_train)

What’s the difference between this and say a non ML model/algorithm? Is it that in a non ML model I have to find the coefficients/parameters myself by statistical methods?; and in the ML model the machine does that itself?
If this is the case then to me it seems that a researcher/coder did most of the work for me and wrap it in a nice function. Am I missing something? What is special here?

• Jason Brownlee October 28, 2016 at 9:14 am #

Hi Andy,

Yes, your comment is generally true.

The work is in the library and choice of good libraries and training on how to use them well on your project can take you a very long way very quickly.

Stats is really about small data and understanding the domain (descriptive models). Machine learning, at least in common practice, is leaning towards automation with larger datasets and making predictions (predictive modeling) at the expense of model interpretation/understandability. Prediction performance trumps traditional goals of stats.

Because of the automation, the focus shifts more toward data quality, problem framing, feature engineering, automatic algorithm tuning and ensemble methods (combining predictive models), with the algorithms themselves taking more of a backseat role.

Does that make sense?

• Andy November 3, 2016 at 10:36 pm #

It does make sense.
You mentioned ‘data quality’. That’s currently my field of work. I’ve been doing this statistically until now, and very keen to try a different approach. As a practical example how would you use ML to spot an error/outlier using ML instead of stats?
Let’s say I have a large dataset containing trees: each tree record contains a specie, height, location, crown size, age, etc… (ah! suspiciously similar to the iris flowers dataset 🙂 Is ML a viable method for finding incorrect data and replace with an “estimated” value? The answer I guess is yes. For species I could use almost an identical method to what you presented here; BUT what about continuous values such as tree height?

• Jason Brownlee November 4, 2016 at 9:08 am #

Hi Andy,

Maybe “outliers” are instances that cannot be easily predicted or assigned ambiguous predicted probabilities.

Instance values can be “fixed” by estimating new values, but whole instance can also be pulled out if data is cheap.

14. Shailendra Khadayat October 30, 2016 at 2:23 pm #

Awesome work Jason. This was very helpful and expect more tutorials in the future.

Thanks.

15. Shuvam Ghosh November 16, 2016 at 12:13 am #

Awesome work. Students need to know how the end results will look like. They need to get motivated to learn and one of the effective means of getting motivated is to be able to see and experience the wonderful end results. Honestly, if i were made to study algorithms and understand them i would get bored. But now since i know what amazing results they give, they will serve as driving forces in me to get into details of it and do more research on it. This is where i hate the orthodox college ways of teaching. First get the theory right then apply. No way. I need to see things first to get motivated.

• Jason Brownlee November 16, 2016 at 9:29 am #

Thanks Shuvam,

I’m glad my results-first approach gels with you. It’s great to have you here.

16. Puneet November 17, 2016 at 12:08 am #

Thanks Jason,

while i am trying to complete this.

# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)

showing below error.-

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
^
IndentationError: expected an indented block-

17. Puneet November 17, 2016 at 12:30 am #

Thanks Json,

I am new to ML. need your help so i can run this.

as i have followed the steps but when trying to build and evalute 5 model using this.

—————————————-
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
————————————————————————————————

facing below mentioned issue.
File “”, line 13
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
^
IndentationError: expected an indented block

—————————————
Kindly help.

• Martin November 18, 2016 at 5:18 am #

Puneet, you need to indent the block (tab or four spaces to the right). That is the way of building a block in Python

• Casey December 2, 2018 at 3:58 am #

I am also having this problem, I have indented the code as instructed but nothing executes. It seems to be waiting for more input. I have googled different script endings but nothing happens. Is there something I am missing to execute this script?

>>> for name, model in models:
… kfold = model_selection.KFold(n_splits=10, random_state=seed)
… cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
… results.append(cv_results)
… names.append(name)
… msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
… print(msg)

18. george soilis November 17, 2016 at 10:00 pm #

just another Python noob here,sending many regards and thanks to Jason :):)

• Jason Brownlee November 18, 2016 at 8:22 am #

Thanks george, stick with it!

19. sergio November 22, 2016 at 3:29 pm #

Does this tutorial work with other data sets? I’m trying to work on a small assignment and I want to use python

• Jason Brownlee November 23, 2016 at 8:50 am #

It should provide a great template for new projects sergio.

• Brian February 28, 2018 at 4:10 am #

I tried to use another dataset. I am not sure what I imported, but even after changing the names, I still get the petal stuff as output. All of it. I commented out that part of the code and even then it gives me those old outputs.

20. Albert November 26, 2016 at 1:55 am #

Very Awesome step by step for me ! Even I am beginner of python , this gave me many things about Machine learning ~ supervised ML. Appreciate of your sharing !!

• Jason Brownlee November 26, 2016 at 10:38 am #

I’m glad to hear that Albert.

21. Umar Yusuf November 27, 2016 at 4:04 am #

Thank you for the step by step instructions. This will go along way for newbies like me getting started with machine learning.

• Jason Brownlee November 27, 2016 at 10:21 am #

You’re welcome, I’m glad you found the post useful Umar.

• Shiva Andure March 18, 2019 at 3:08 pm #

Hello Jason,

from __future__ import division
models = []
models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC(gamma=’auto’)))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)

I am getting erroe of ” ZeroDivisionError: float division by zero”

22. Mike P November 30, 2016 at 6:29 pm #

Hi Jason,

Really nice tutorial. I had one question which has had me confused. Once you chose your best model, (in this instance KNN) you then train a new model to be used to make predictions against the validation set. should one not perform K-fold cross-validation on this model to ensure we don’t overfit?

if this is correct how would you implement this, from my understanding cross_val_score will not allow one to generate a confusion matrix.

I think this is the only thing that I have struggled with in using scikit learn if you could help me it would be much appreciated?

• Jason Brownlee December 1, 2016 at 7:26 am #

Hi Mike. No.

Cross-validation is just a method to estimate the skill of a model on new data. Once you have the estimate you can get on with things, like confirming you have not fooled yourself (hold out validation dataset) or make predictions on new data.

The skill you report is the cross val skill with the mean and stdev to give some idea of confidence or spread.

Does that make sense?

• Mike December 2, 2016 at 1:30 am #

Hi Jason,

Thanks for the quick response. So to make sure I understand, one would use cross validation to get a estimate of the skill of a model (mean of cross val scores) or chose the correct hyper parameters for a particular model.

Once you have this information you can just go ahead and train the chosen model with the full training set and test it against the validation set or new data?

• Jason Brownlee December 2, 2016 at 8:17 am #

Hi Mike. Correct.

Additionally, if the validation result confirms your expectations, you can go ahead and train the model on all data you have including the validation dataset and then start using it in production.

This is a very important topic. I think I’ll write a post about it.

23. Sahana Venkatesh November 30, 2016 at 8:15 pm #

This is amazing 🙂 You boosted my morale

• Jason Brownlee December 1, 2016 at 7:26 am #

I’m so glad to hear that Sahana.

24. Jhon November 30, 2016 at 8:27 pm #

Hi
while doing data visualization and running commands dataset.plot(……..) i am having the following error.kindly tell me how to fix it

array([[,
],
[,
]], dtype=object)

• Jason Brownlee December 1, 2016 at 7:28 am #

Looks like no data Jhon. It also looks like it’s printing out an object.

Are you running in a notebook or on the command line? The code was intended to be run directly (e.g. command line).

25. Brendon A. Kay December 1, 2016 at 4:20 am #

Hi Jason,

Great tutorial. I am a developer with a computer science degree and a heavy interest in machine learning and mathematics, although I don’t quite have the academic background for the latter except for what was required in college. So, this website has really sparked my interest as it has allowed me to learn the field in sort of the “opposite direction”.

I did notice when executing your code that there was a deprecation warning for the sklearn.cross_validation module. They recommend switching to sklearn.model_selection.

When switching the modules I adjusted the following line…

kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)

to…

kfold = model_selection.KFold(n_folds=num_folds, random_state=seed)

… and it appears to be working okay. Of course, I had switched all other instances of cross_validation as well, but it seemed to be that the KFold() method dropped the n (number of instances) parameter, which caused a runtime error. Also, I dropped the num_instances variable.

I could have missed something here, so please let me know if this is not a valid replacement, but thought I’d share!

Once again, great website!

• Jason Brownlee December 1, 2016 at 7:33 am #

Thanks for the support and the kind words Brendon. I really appreciate it (you made my day!)

Yes, the API has changed/is changing and your updates to the tutorial look good to me, except I think n_folds has become n_splits.

I will update this example for the new API very soon.

• Brendon A. Kay December 1, 2016 at 8:01 am #

🙂 Now on to more tutorials for me!

• Jason Brownlee December 2, 2016 at 8:11 am #

You can access more here Brendon:
https://machinelearningmastery.com/start-here/

• Doug March 9, 2018 at 5:56 am #

Jason, is everything on your website on that page? or is there another site map?

thanks!

P.S. your code ran flawlessly on my Jupyter Notebook fwiw. Although I did get a different result with SVM coming out on top with 99.1667. So I ran the validation set with SVM and came out with 94 93 93 30 fwiw.

• Jason Brownlee March 9, 2018 at 6:29 am #

No, not everything, just a small and useful sample.

https://machinelearningmastery.com/randomness-in-machine-learning/

• Doug March 9, 2018 at 6:46 am #

26. Sergio December 1, 2016 at 3:41 pm #

I’m still having a little trouble understanding step 5.1. I’m trying to apply this tutorial to a new data set but, when I try to evaluate the models from 5.3 I don’t get a result.

• Jason Brownlee December 2, 2016 at 8:13 am #

What is the problem exactly Sergio?

Step 5.1 should create a validation dataset. You can confirm the dataset by printing it out.

Step 5.3 should print the result of each algorithm as it is trained and evaluated.

Perhaps check for a copy-paste error or something?

• sergio December 2, 2016 at 9:13 am #

Does this tutorial work the exact same way for other data sets? because I’m not using the Hello World dataset

• Jason Brownlee December 3, 2016 at 8:23 am #

The project template is quite transferable.

You will need to adapt it for your data and for the types of algorithms you want to test.

27. Jean-Baptiste Hubert December 11, 2016 at 12:17 am #

Hi Sir,
Thank you for the information.
I am currently a student, in Engineering school in France.
I am working on date mining project, indeed, I have a many date ( 40Go ) about the price of the stocks of many companies in the CAC40.
My goal is to predict the evolution of the yields and I think that Neural Network could be useful.
My idea is : I take for X the yields from “t=0” to “t=n” and for Y the yields from “t=1 to t=n” and the program should find a relation between the data.
Is that possible ? Is it a good way in order to predict the evolution of the yield ?
Hubert
Jean-Baptiste

28. Ernest Bonat December 15, 2016 at 5:33 pm #

Hi Jason,

If I include an new item in the models array as:

models.append((‘LNR – Linear Regression’, LinearRegression()))

with the library:

from sklearn.linear_model import LinearRegression

I got an error in the \sklearn\utils\validation.py”, line 529, in check_X_y
y = y.astype(np.float64)

as:

ValueError: could not convert string to float: ‘Iris-setosa’

Let me know best to fix that! As you can see from my code, I would like to include the Linear Regression algorithms in my array model too!

Ernest

• Jason Brownlee December 16, 2016 at 5:39 am #

Hi Ernest, it is a classification problem. We cannot use LinearRegression.

Try adding another classification algorithm to the list.

• oumaima December 9, 2017 at 11:29 am #

Hi Jason,
I am new to ML. need your help so i can run this.

>>> from matplotlib import pyplot
Traceback (most recent call last):
File “”, line 1, in
File “c:\python27\lib\site-packages\matplotlib\pyplot.py”, line 29, in
import matplotlib.colorbar
File “c:\python27\lib\site-packages\matplotlib\colorbar.py”, line 32, in
import matplotlib.artist as martist
File “c:\python27\lib\site-packages\matplotlib\artist.py”, line 16, in
from .path import Path
File “c:\python27\lib\site-packages\matplotlib\path.py”, line 25, in
from . import _path, rcParams
‘ImportError: DLL load failed: %1 n\x92est pas une application Win32 valide.\n’

29. Gokul Iyer December 20, 2016 at 2:29 pm #

Great tutorial! Quick question, for the when we create the models, we do models.append(name of algorithm, alogrithm function), is models an array? Because it seems like a dictionary since we have a key-value mapping (algorithm name, and algorithm function). Thank you!

• Jason Brownlee December 20, 2016 at 2:47 pm #

It is a list of tuples where each tuple contains a string name and a model object.

30. Sasanka ghosh December 21, 2016 at 4:55 am #

Hi Jason /any Gurus ,
Good post and will follow it but my question may be little off track.
Asking this question as i am a data modeller /aspiring data architect.

I i feel as Guru/Gurus you can clarify my doubt. The question is at the end .

In current Data management environment

1. Data architecture /Physical implementation and choosing appropriate tools,back end,storage,no sql, SQL, MPP, sharding, columnar ,,scale up/out ,distributed processing etc .

2. In addition to DB based procedural languages proficiency at at least one of the following i.e. Java/Python/Scala etc.

3. Then comes this AI,Machine learning ,neural Networks etc .

My question is regarding point 3 .

I believe those are algorithms which needs deep functional knowledge and years of experience to add any value to business .

Those are independent of data models and it’s ,physical implementation and part of Business user domain not data architecture domain .

If i take your above example say now 10k users trying to do the similar kind of thing then points 1 and 2 will be Data architects a domain and point 3 will be business analyst domain . may be point 2 can overlap between them to some extent .

Data Architect need not to be hands on/proficient in algorithms i.e. should have just some basic idea as Data architects job is not to invent business logic but implement the business logic physically to satisfy Business users/Analysts .

Am i correct in my assumption as i find the certain things are nearly mutually exclusive and expectations/benchmarks should be set right?

Regards
sasanka ghosh

• Jason Brownlee December 21, 2016 at 8:46 am #

Hi Sasanka, sorry, I don’t really follow.

Are you able to simplify your question?

• Sasanka ghosh December 21, 2016 at 9:25 pm #

Hi Jason ,
Many thanks that u bothered to reply .

Tried to rephrase and concise but still it is verbose . apologies for that.

Is it expected from a data architect to be algorithm expert as well as data model/database expert?

Algorithms are business centric as well as specific to particular domain of business most of the times.

Giving u an example i.e. SHORTEST PATH ( take it as just an example in making my point)
An organization is providing an app to provide that service .

CAVEAT:Someone may say from comp science dept that it is the basic thing u learn but i feel it is still an algorithm not a data structure .

if we take the above scenario in simplistic term the requirement is as follows

1.there will be say million registered users
2. one can say at least 10 % are using the app same time
3. any time they can change their direction as per contingency like a military op so dumping the partial weighted graph to their device is not an option i.e. users will be connected to main server/server cluster.
4. the challenge is storing the spatial data in DB in correct data model .
scale out ,fault tolerance .
5.implement the shortest path algo and display it using Python/java/Cipher/Oracle spatial/titan etc dynamically.

My question is can a data architect work on this project who does not know the shortest path algorithm but have sufficient knowledge in other areas but the algo with verbose term provided to him/her to implement ?

I m asking this question as now a days people are offering ready made courses etc i.e. machine learning ,NLP,Data scientist etc. and the scenario is confusing
i feel it is misleading as no one can get expert in science overnight and vice versa.

I feel Algorithms are pure science that is a separate discipline .
But to implement it in large scale Scientists/programmers/architects needs to work in tandem with minimal overlapping but continuous discussion.

Last but not the least if i make some sense what is the learning curve should i follow to try to be a data architect in unstructured data in general

regards
sasanka ghosh

• Jason Brownlee December 22, 2016 at 6:35 am #

Really this depends on the industry and the job. I cannot give you good advice for the general case.

You can get valuable results without being an expert, this applies to most fields.

Algorithms are a tool, use them as such. They can also be a science, but we practitioners don’t have the time.

I hope that helps.

• Sasanka ghosh December 22, 2016 at 7:00 pm #

Thanks Jsaon.

I appreciate your time and response .

I just wanted to validate from a real techie/guru like u as the confusion or no perfect answer are being exploited by management/HR to their own advantage and practice use and throw policy or make people sycophants/redundant without following the basic management principle,

The tech guys except ” few geniuses” are always toiling and management is opening the cork, enjoying at the same time .

Regards
sasanka ghosh

31. Raveen Sachintha December 21, 2016 at 8:51 pm #

Hello Jason,
Thank you very much for these tutorials, i am new to ML and i find it very encouraging to do and end to end project to get started with rather than reading and reading without seeing and end, This really helped me..

One question, when i tried this i got the highest accuracy for SVM.

LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.983333 (0.033333)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)

so i decided to try that out too,,

svm = SVC()
svm.fit(X_train, Y_train)
prediction = svm.predict(X_validation)

these were my results using SVM,

0.933333333333
[[ 7 0 0]
[ 0 10 2]
[ 0 0 11]]
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 1.00 0.83 0.91 12
Iris-virginica 0.85 1.00 0.92 11

avg / total 0.94 0.93 0.93 30

I am still learning to read these results, but can you tell me why this happened? why did i get high accuracy for SVM instead of KNN?? have i done anything wrong? or is it possible?

• Jason Brownlee December 22, 2016 at 6:33 am #

The results reported are a mean estimated score with some variance (spread).

It is an estimate on the performance on new data.

When you apply the method on new data, the performance may be in that range. It may be lower if the method has overfit the training data.

Overfitting is a challenge and developing a robust test harness to ensure we don’t fool/mislead ourselves during model development is important work.

I hope that helps as a start.

32. inzar December 25, 2016 at 7:04 am #

i try this tutorial and the result is very awesome

i want to learn from you

thanks….

33. lou December 25, 2016 at 7:29 am #

Why the leading comma in X = array[:,0:4]?

34. Thinh December 26, 2016 at 5:05 am #

In 1.2 , should warn to install scikit-learn

• Jason Brownlee December 26, 2016 at 7:49 am #

Thanks for the note.

Please see section 1.1 Install SciPy Libraries where it says:

There are 5 key libraries that you will need to install… sklearn

35. Tijo L. Peter December 28, 2016 at 10:34 pm #

Best ML tutorial for Python. Thank you, Jason.

36. baso December 29, 2016 at 12:38 am #

when i tried run, i have error message” TypeError: Empty ‘DataFrame’: no numeric data to plot” help me

• Jason Brownlee December 29, 2016 at 7:18 am #

Sorry to hear that.

Perhaps check that you have loaded the data as you expect and that the loaded values are numeric and not strings. Perhaps print the first few rows: print(df.head(5))

• baso December 29, 2016 at 1:05 pm #

thanks very much Jason for your time

it worked. these tutorial very help for me. im new in Machine learning, but may you explain to me about your simple project above? because i did not see X_test and target

37. Andrea January 5, 2017 at 1:42 am #

Thank you for sharing this. I bumped into some installation problems.
Eventually, yo get all dependencies installed on MacOS 10.11.6 I had to run this:

brew install python
pip install –user numpy scipy matplotlib ipython jupyter pandas sympy nose scikit-learn
export PATH=\$PATH:~/Library/Python/2.7/bin

• Jason Brownlee January 5, 2017 at 9:21 am #

Thanks for sharing Andrea.

I’m a macports guy myself, here’s my recipe:

38. Sohib January 6, 2017 at 6:26 pm #

Hi Jason,
I am following this page as a beginner and have installed Anaconda as recommended.
As I am on win 10, I installed Anaconda 4.2.0 For Windows Python 2.7 version (x64) and
I am using Anaconda’s Spyder (python 2.7) IDE.

I checked all the versions of libraries (as shown in 1.2 Start Python and Check Versions) and got results like below:

Python: 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
scipy: 0.18.1
numpy: 1.11.1
matplotlib: 1.5.3
pandas: 0.18.1
sklearn: 0.17.1

At the 2.1 Import libraries section, I imported all of them and tried to load data as shown in
2.2 Load Dataset. But when I run it, it doesn’t show an output, instead, there is an error:

Traceback (most recent call last):
File “C:\Users\gachon\.spyder\temp.py”, line 4, in
from sklearn import model_selection
ImportError: cannot import name model_selection

Below is my code snippet:

import pandas
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
print(dataset.shape)

When I delete “from sklearn import model_selection” line I get expected results (150, 5).

Am I missing something here?

Thank you for your time and endurance!

• Jason Brownlee January 7, 2017 at 8:23 am #

Hi Sohib,

You must have scikit-learn version 0.18 or higher installed.

Perhaps Anaconda has documentation on how to update sklearn?

• Sohib January 10, 2017 at 12:15 pm #

I updated scikit-learn version to 0.18.1 and it helped.
The error disappeared, the result is shown, but one statement

‘import sitecustomize’ failed; use -v for traceback

is executed above the result.
I tried to find out why, but apparently I might not find the reason.
Is it going to be a problem in my further steps?
How to solve this?

• Jason Brownlee January 11, 2017 at 9:25 am #

Sorry, I don’t know what “import sitecustomize” is or why you need it.

39. Vishakha January 7, 2017 at 10:10 pm #

Can i get the same tutorial with java

40. Abhinav January 8, 2017 at 8:27 pm #

Hi Jason,

Nice tutorial.

In univariate plots, you mentioned about gaussian distribution.

According to the univariate plots, sepat-width had gaussian distribution. You said there are 2 variables having gaussain distribution. Please tell the other.

Thanks

• Jason Brownlee January 9, 2017 at 7:49 am #

The distribution of the others may be multi-modal. Perhaps a double Gaussian.

41. Thinh January 13, 2017 at 5:07 am #

Hi, Jason. Could you please tell me the reason Why you choose KNN in example above ?

• Jason Brownlee January 13, 2017 at 9:16 am #

Hi Thinh,

No reason other than it is an easy algorithm to run and understand and good algorithm for a first tutorial.

42. Scott P January 13, 2017 at 10:25 pm #

Hi Jason,

I’m trying to use this code with the KDD Cup ’99 dataset, and I am having trouble with LabelEncoding my dataset in to numerical values.

#Modules
import pandas
import numpy
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
#new
from collections import defaultdict
#

data_set = “NSL-KDD/KDDTrain+.txt”
‘dst_host_count’,’dst_host_srv_count’,’dst_host_same_srv_rate’,’dst_host_diff_srv_rate’,’dst_host_same_src_port_rate’,’dst_host_srv_diff_host_rate’,’dst_host_serror_rate’,’dst_host_srv_serror_rate’,’dst_host_rerror_rate’,
‘dst_host_srv_rerror_rate’,’class’]

#Diabetes Dataset
#data_set = “Datasets/pima-indians-diabetes.data”
#names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
#data_set = “Datasets/iris.data”
#names = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’,’class’]

array = dataset.values
X = array[:,0:40]
Y = array[:,40]

label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)

validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, label_encoded_y, test_size=validation_size, random_state=seed)

# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = ‘accuracy’

# Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)

# Compare Algorithms
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison’)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()

Am I doing something wrong with the LabelEncoding process?

• MegO_Bonus June 4, 2017 at 7:15 pm #

Hi. Change all symbols like “ to ” and ’ to ‘. LabaleEncoder will be work correct but not all network. I try to create a neural network for NSL-KDD too. Have you any good examples?

• Rajnish July 17, 2019 at 8:21 am #

How come it is concluded that KNN algorithm is accurate model when mean value for SVM algorithm is closer to 1 in comparison to KNN ?

• Jason Brownlee July 17, 2019 at 8:32 am #

Either algorithm would be effective on the dataset.

43. Dan January 14, 2017 at 4:56 am #

Hi, I’m running a bit of a different setup than yours.

The modules and version of python I’m using are more recent releases:

Python: 3.5.2 |Anaconda 4.2.0 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
scipy: 0.18.1
numpy: 1.11.3
matplotlib: 1.5.3
pandas: 0.19.2
sklearn: 0.18.1

And I’ve gotten SVM as the best algorithm in terms of accuracy at 0.991667 (0.025000).

Would you happen to know why this is, considering more recent versions?

I also happened to get a rather different boxplot but I’ll leave it at what I’ve said thus far.

• Jason Brownlee January 15, 2017 at 5:26 am #

Hi Dan,

You may get differing results for a variety of reasons. Small changes in the code will affect the result. This is why we often report mean and stdev algorithm performance rather than one number, to given a range of expected performance.

This post on randomness in ml algorithms might also help:
https://machinelearningmastery.com/randomness-in-machine-learning/

44. Duncan Carr January 17, 2017 at 1:44 am #

Hi Jason

I can’t tell you how grateful I am … I have been trawling through lots of ML stuff to try to get started with a “toy” example. Finally I have found the tutorial I was looking for. Anaconda had old sklearn: 0.17.1 for Windows – which caused an error “ImportError: cannot import name ‘model_selection'”. That was fixed by running “pip install -U scikit-learn” from the Anaconda command-line prompt. Now upgraded to 0.18. Now everything in your imports was fine.

All other tutorials were either too simple or too complicated. Usually the latter!

Thank you again 🙂

• Jason Brownlee January 17, 2017 at 7:39 am #

Thanks for the tip for Anaconda uses.

I’m here to help if you have questions!

45. Malathi January 17, 2017 at 3:13 am #

Hi Jason,

to me. Easy to understand.

Expecting more tutorials on deep neural networks.

Malathi

• Jason Brownlee January 17, 2017 at 7:40 am #

You’re very welcome Malathi, glad to hear it.

46. Duncan Carr January 17, 2017 at 7:32 pm #

Hi Jason

I managed to get it all working – I am chuffed to bits.

I get exactly the same numbers in the classification report as you do … however, when I changed both seeds to 8 (from 7), then ALL of the numbers end up being 1. Is this good, or bad? I am a bit confused.

Thanks again.

• Jason Brownlee January 18, 2017 at 10:14 am #

Well done Duncan!

What do you mean all the numbers end up being one?

47. Duncan Carr January 18, 2017 at 8:02 pm #

Hi Jason

I’ve output the “accuracy_score”, “confusion_matrix” & “classification_report” for seeds 7, 9 & 10. Why am I getting a perfect score with seed=9? Many thanks.

(seed=7)

0.9

[[10 0 0]
[ 0 8 1]
[ 0 2 9]]

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10
Iris-versicolor 0.80 0.89 0.84 9
Iris-virginica 0.90 0.82 0.86 11

avg / total 0.90 0.90 0.90 30

(seed=9)

1.0

[[13 0 0]
[ 0 9 0]
[ 0 0 8]]

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 13
Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 8

avg / total 1.00 1.00 1.00 30

(seed=10)

0.9666666666666667

[[10 0 0]
[ 0 12 1]
[ 0 0 7]]

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 10
Iris-versicolor 1.00 0.92 0.96 13
Iris-virginica 0.88 1.00 0.93 7

avg / total 0.97 0.97 0.97 30

48. shivani January 20, 2017 at 8:40 pm #

from sklearn import model_selection
showing Import Error: can not import model_selection

• Jason Brownlee January 21, 2017 at 10:25 am #

You need to update your version of sklearn to 0.18 or higher.

49. Jim January 22, 2017 at 5:06 pm #

Jason

Excellent Tutorial. New to Python and set a New Years Resolution to try to understand ML. This tutorial was a great start.

I struck the issue of the sklearn version. I am using Ubuntu 16.04LTS which comes with python-sklearn version 0.17. To update to latest I used the site:
http://neuro.debian.net/install_pkg.html?p=python-sklearn
Which gives the commands to add the neuro repository and pull down the 0.18 version.

Also I would like to note there is an error in section 3.1 Dimensions of the Dataset. Your text states 120 Instances when in fact 150 are returned, which you have in the Printout box.

Keep up the good work.

Jim

• Jason Brownlee January 23, 2017 at 8:37 am #

I’m glad to hear you worked around the version issue Jim, nice work!

Thanks for the note on the typo, fixed!

50. Raphael January 23, 2017 at 4:15 pm #

hi Jason.nice work here. I’m new to your blog. What does the y-axis in the box plots represent?

• Jason Brownlee January 24, 2017 at 11:01 am #

Hi Raphael,

The y-axis in the box-and-whisker plots are the scale or distribution of each variable.

51. Kayode January 23, 2017 at 8:42 pm #

Thank you for this wonderful tutorial.

52. Raphael January 26, 2017 at 2:28 am #

hi Jason,

In this line

dataset.groupby(‘class’).size()

what other variable other than size could I use? I changed size with count and got something similar but not quite. I got key errors for the other stuffs I tried. Is size just a standard command?

53. Scott January 26, 2017 at 10:35 pm #

Jason,

I’m trying to use a different data set (KDD CUP 99′) with the above code, but when I try and run the code after modifying “names” and the array to account for the new features and it will not run as it is giving me an error of: “cannot convert string to a float”.

In my data set, there are 3 columns that are text and the rest are integers and floats, I have tried LabelEncoding but it gives me the same error, do you know how I can resolve this?

• Jason Brownlee January 27, 2017 at 12:08 pm #

Hi Scott,

If the values are indeed strings, perhaps you can use a method that supports strings instead of numbers, perhaps like a decision tree.

If there are only a few string values for the column, a label encoding as integers may be useful.

Alternatively, perhaps you could try removing those string features from the dataset.

I hope that helps, let me know how you go.

54. Weston Gross January 31, 2017 at 10:41 am #

I would like a chart to see the grand scope of everything for data science that python can do.

You list 6 basic steps. For example in the visualizing step, I would like to know what all the charts are, what they are used for, and what python library it comes from.

I am extremely new to all this, and understand that some steps have to happen for example

1. Get Data
2. Validate Data
3. Missing Data
4. Machine Learning
5. Display Findinds

So for missing data, there are techniques to restore the data, what are they and what libraries are used?

• Jason Brownlee February 1, 2017 at 10:36 am #

You can handle missing data in a few ways such as:

1. Remove rows with missing data.
2. Impute missing data (e.g. use the Imputer class in sklearn)
3. Use methods that support missing data (e.g. decision trees)

I hope that helps.

55. Mohammed February 1, 2017 at 1:11 am #

Hi Jason,

I am a Non Tech Data Analyst and use SPSS extensively on Academic / Business Data over the last 6 years.

I understand the above example very easily.

I want to work on Search – Language Translation and develop apps.

Whats the best way forward …

Do you also provide Skype Training / Project Mentoring..

• Jason Brownlee February 1, 2017 at 10:51 am #

Thanks Mohammed.

Sorry, I don’t have good advice for language translation applications.

56. Mohammed February 1, 2017 at 1:14 am #

I dont have any Development / Coding Background.

Everything worked perfectly fine.

Looking forward to go all in…

• Jason Brownlee February 1, 2017 at 10:51 am #

I’m glad to hear that Mohammed

57. Purvi February 1, 2017 at 7:31 am #

Hi Jason,

I am new to Machine learning and am trying out the tutorial. I have following environment :

>>> import sys
>>> print(‘Python: {}’.format(sys.version))
Python: 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
>>> import scipy
>>> print(‘scipy: {}’.format(scipy.__version__))
scipy: 0.18.1
>>> import numpy
>>> print(‘numpy: {}’.format(numpy.__version__))
numpy: 1.12.0
>>> import matplotlib
>>> print(‘matplotlib: {}’.format(matplotlib.__version__))
matplotlib: 2.0.0
>>> import pandas
>>> print(‘pandas: {}’.format(pandas.__version__))
pandas: 0.19.2
>>> import sklearn
>>> print(‘sklearn: {}’.format(sklearn.__version__))
sklearn: 0.18.1

When I try to load the iris dataset, it loads up fine and prints dataset.shape but then my python interpreter hangs. I tried it out 3-4 times and everytime it hangs after I run couple of commands on dataset.
>>> url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
>>> names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
>>> print(dataset.shape)
(150, 5)
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
>>> print(datase

It does not let me type anything further.

Thanks,
Purvi

• Jason Brownlee February 1, 2017 at 10:55 am #

Hi Purvi, sorry to hear that.

Perhaps you’re able to comment out the first parts of the tutorial and see if you can progress?

58. sam February 5, 2017 at 9:24 am #

Hi Jason

i am planning to use python to predict customer attrition.I have current list of attrited customers with their attributes.I would like to use them as test data and use them to predict any new customers.Can you please help to approach the problem in python ?

my test data :

customer1 attribute1 attribute2 attribute3 … attrited

my new data

customer N, attribute 1,…… ?

59. Kiran Prajapati February 7, 2017 at 6:31 pm #

Hello Sir, I want to check my data is how many % accurate, In my data , I have 4 columns ,

Taluka , Total_yield, Rain(mm) , types_of soil

Nasik 12555 63.0 dark black
Igatpuri 1560 75.0 shallow

So on,
first, I have to check data is accurate or not, and next step is what is the predicted yield , using regression model.
Here is my model Total_yield = Rain + types_of soil

I use 0 and 1 binary variable for types_of soil.

and how to find predicted yield ?

60. Saby February 15, 2017 at 9:11 am #

url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]

The dataset should load without incident.

If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing url to the local file name.

I am a very beginner python learner(trying to learn ML as well), I tried to load data from my local file but could not be successful. Will you help me out how exactly code should be written to open the data from local file.

• Jason Brownlee February 15, 2017 at 11:39 am #

Sure.

Download the file as iris.data into your current working directory (where your python file is located and where you are running the code from).

61. ant February 15, 2017 at 9:54 pm #

Hi, Jason, first of all thank so much for this amazing lesson.

Just for curiosity I have computed all the values obtained with dataset.describe() with excel and for the 25% value of petal-length I get 1.57500 instead of 1.60000. I have googled for formatting describe() output unsuccessfully. Is there an explanation? Tnx

• Jason Brownlee February 16, 2017 at 11:07 am #

Not sure, perhaps you could look into the Pandas source code?

• ant February 17, 2017 at 12:23 am #

OK, I will do.

62. jacques February 16, 2017 at 4:42 pm #

HI Jason

I don’t quite follow the KFOLD section ?

We started of with 150 data-entries(rows)

We then use a 80/20 split for validation/training that leaves us with 120

The split 10 boggles me ??
Does it take 10 items from each class and train with 9 ? what does the other 1 left do then ?

• Jason Brownlee February 17, 2017 at 9:52 am #

Hi jacques,

The 120 records are split into 10 folds. The model is trained on the first 9 folds and evaluated on the records in the 10th. This is repeated so that each fold is given a chance to be the hold out set. 10 models are trained, 10 scores collected and we report the mean of those scores as an estimate of the performance of the model on unseen data.

Does thar help?

63. Alhassan February 17, 2017 at 4:02 pm #

I am trying to integrate machine learning into a PHP website I have created. Is there any way I can do that using the guidelines you provided above?

• Jason Brownlee February 18, 2017 at 8:34 am #

I have not done this Alhassan.

Generally, I would advise developing a separate service that could be called using REST calls or similar.

If you are working on a prototype, you may be able to call out to a program or script from cgi-bin, but this would require careful engineering to be secure in a production environment.

64. Simão Gonçalves February 20, 2017 at 1:27 am #

Hi Jason! This tutorial was a great help, i´m truly grateful for this so thank you.

I have one question about the tutorial though, in the Scattplot Matrix i can´t understand how can we make the dots in the graphs whose variables have no relationship between them (like sepal-lenght with petal_width).

Could you or someone explain that please? how do you make a dot that represents the relationship between a certain sepal_length with a certain petal-width

• Jason Brownlee February 20, 2017 at 9:30 am #

Hi Simão,

The x-axis is taken for the values of the first variable (e.g. sepal_length) and the y-axis is taken for the second variable (e.g. petal_width).

Does that help?

• Yopo February 21, 2017 at 4:35 am #

you match each iris instance’s length and width with each other. for example, iris instance number one is represented by a dot, and the dot’s values are the iris length and width! so actually, when you take all these values and put them on a graph you are basically checking to see if there is a relation. as you can see some in some of these plots the dots are scattered all around, but when we look at the petal width – petal length graph it seems to be linear! this means that those two properties are clearly related. hope this hepled!

65. Sébastien February 20, 2017 at 9:34 pm #

Hi Jason,

from France and just to say you “Thank you for this very clear tutorial!”

Sébastien

• Jason Brownlee February 21, 2017 at 9:34 am #

I’m glad you found it useful Sébastien.

66. Raj February 27, 2017 at 2:53 am #

Hi Jason,
I am new to ML & Python. Your post is encouraging and straight to the point of execution. Anyhow, I am facing below error when

>>> validataion_size = 0.20
>>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state = seed)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘validation_size’ is not defined

What could be the miss out? I din’t get any errors in previous steps.

My Environment details:
OS: Windows 10
Python : 3.5.2
scipy : 0.18.1
numpy : 1.11.1
sklearn : 0.18.1
matplotlib : 0.18.1

• Jason Brownlee February 27, 2017 at 5:54 am #

Hi Raj,

Double check you have the code from section “5.1 Create a Validation Dataset” where validation_size is defined.

I hope that helps.

67. Roy March 2, 2017 at 7:38 am #

Hey Jason,

Can you please explain what precision,recall, f1-score, support actually refer to?
Also what the numbers in a confusion matrix refers to?
[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]
Thanks.

68. santosh March 3, 2017 at 7:29 am #

what code should i use to load data from my working directory??

69. David March 7, 2017 at 8:27 am #

Hi Jason,

I have a ValueError and i don’t know how can i solve this problem

My problem like that,

ValueError: could not convert string to float: ‘2013-06-27 11:30:00.0000000’

Can u give some information abaut the fixing this problem?

Thank you

• Jason Brownlee March 7, 2017 at 9:39 am #

It looks like you are trying to load a date-time. You might need to write a custom function to parse the date-time when loading or try removing this column from your dataset.

70. Saugata De March 8, 2017 at 6:11 am #

>>> for name, model in models:
… kfold=model_selection.Kfold(n_splits=10, random_state=seed)
… cv_results =model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
… results.append(cv_results)
… names.append(name)
… msg=”%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
… print(msg)

After typing this piece of code, it is giving me this error. can you plz help me out Jason. Since I am new to ML, dont have so much idea about the error.

Traceback (most recent call last):
File “”, line 2, in
AttributeError: module ‘sklearn.model_selection’ has no attribute ‘Kfold’

• Asad Ali July 23, 2017 at 12:59 pm #

the KFold function is case-sensitive. It is ” model_selection.KFold(…) ” not ” model_selection.Kfold(…) ”
update this line:
kfold=model_selection.KFold(n_splits=10, random_state=seed)

• ibtssam February 12, 2018 at 9:17 pm #

THANK U

71. Ojas March 10, 2017 at 10:58 am #

Hello Jason ,
Thanks for writing such a nice and explanatory article for beginners like me but i have one concern , i tried finding it out on other websites as well but could not come up with any solution.
Whatever i am writing inside the code editor (Jupyter Qtconsole in my case) , can this not be save as a .py file and shared with my other members over github maybe?. I found some hacks though but i have a thinking that there must be some proper way of sharing the codes written in the editor. , like without the outputs or plots in between.

• Jason Brownlee March 11, 2017 at 7:55 am #

You can write Python code in a text editor and save it as a myfile.py file. You can then run it on the command line as follows:

Consider picking up a book on Python.

72. manoj maracheea March 11, 2017 at 9:37 pm #

Hello Jason,

Nice tutorials I done this today.

I didn’t really understand everything, { I will follow your advice, will do it again, write all the question down, and use the help function.}

The tutorials just works, I take around 2 hours to do it typing every single line.
install all the dependencies, run on each blocks types, to check.

Thanks, I be visiting your blogs, time to time.

Regards,

• Jason Brownlee March 12, 2017 at 8:23 am #

Well done, and thanks for your support.

Post any questions you have as comments or email me using the “contact” page.

73. manoj maracheea March 11, 2017 at 9:38 pm #

Just I am a beginner too, I am using Visual studio code.

Look good.

74. Vignesh R March 13, 2017 at 9:59 pm #

What exactly is confusion matrix?

75. Dan R. March 14, 2017 at 7:09 am #

Can I ask what is the reason of this problem? Thank for answer 🙂 :
(In my code is just the section, where I Import all the needed libraries..)
I have all libraries up to date, but it still gives me this error->

File “C:\Users\64dri\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py”, line 32, in
from ..utils.fixes import rankdata

ImportError: cannot import name ‘rankdata’

( scipy: 0.18.1
numpy: 1.11.1
matplotlib: 1.5.3
pandas: 0.18.1
sklearn: 0.17.1)

• Jason Brownlee March 14, 2017 at 8:31 am #

Sorry, I have not seen this issue Dan, consider searching or posting to StackOverflow.

76. Cameron March 15, 2017 at 5:28 am #

Jason,

You’re a rockstar, thank you so much for this tutorial and for your books! It’s been hugely helpful in getting me started on machine learning. I was curious, is it possible to add a non-number property column, or will the algorithms only accept numbers?

For example, if there were a “COLOR” column in the iris dataset, and all Iris-Setosa were blue. how could I get this program to accept and process that COLOR column? I’ve tried a few things and they all seem to fail.

• Jason Brownlee March 15, 2017 at 8:16 am #

Great question Cameron!

sklearn requires all input data to be numbers.

You can encode labels like colors as integers and model that.

Further, you can convert the integers to a binary encoding/one-hot encoding which may me more suitable if there is no ordinal relationship between the labels.

• Cameron March 15, 2017 at 2:19 pm #

Jason, thanks so much for replying! That makes a lot of sense. When you say binary/one-hot encoding I assume you mean (continuing to use the colors example) adding a column for each color (R,O,Y,G,B,V) and for each flower putting a 1 in the column of it’s color and a 0 for all of the other colors?
That’s feasible for 6 colors (adding six columns) but how would I manage if I wanted to choose between 100 colors or 1000 colors? Are there other libraries that could help deal with that?

77. James March 19, 2017 at 6:54 am #

for name, model in models:
… kfold = cross_vaalidation.KFold(n=num_instances,n_folds=num_folds,random_state=seed)
… cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
File “”, line 3
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
^
SyntaxError: invalid syntax
>>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘model’ is not defined
>>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘kfold’ is not defined
>>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv =
kfold, scoring = scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘kfold’ is not defined
>>> names.append(name)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘name’ is not defined

• Jason Brownlee March 19, 2017 at 9:12 am #

It looks like you might not have copied all of the code required for the example.

78. Mier March 20, 2017 at 10:26 am #

Hi, I went through your tutorial. It is super great!
I wonder whether you can recommend a data set that is similar to Iris classification for me to practice?

79. Medine H. March 23, 2017 at 2:56 am #

Hi Jason,

That’s an amazing tutorial, quite clear and useful.

Thanks a bunch!

80. Sean March 23, 2017 at 9:54 am #

Hi Jason,

Can you let me know how can I start with Fraud Detection algorithms for a retail website ?

Thanks,
Sean

81. Raja March 24, 2017 at 11:08 am #

You are doing great with your work.

I need your suggestion, i am working on my thesis here i need to work on machine learning.
Training : positive ,negative, others
Test : unknown data
Want to train machine with training and test with unknown data using SVM,Naive,KNN

How can i make the format of training and test data ?
And how to use those algorithms in it
Using which i can get the TP,TN,FP,FN
Thanking you..

82. Sey March 26, 2017 at 12:38 am #

I m new in Machine learning and this was a really helpful tutorial. I have maybe a stupid question I wanted to plot the predictions and the validation value and make a visual comparison and it doesn’t seem like I really understood how I can plot it.
Can you please send me the piece of code with some explanations to do it ?

thank you very much

• Jason Brownlee March 26, 2017 at 6:13 am #

You can use matplotlib, for example:

83. Kamol Roy March 26, 2017 at 7:25 am #

Thanks a lot. It was very helpful.

• Jason Brownlee March 27, 2017 at 7:51 am #

You’re welcome Kamol, I’m glad to hear it.

84. Rajneesh March 29, 2017 at 11:31 pm #

Hi

Sorry for a dumb question.

Can you briefly describe, what the end result means (i.e.. what the program has predicted)

• Jason Brownlee March 30, 2017 at 8:53 am #

Given an input description of flower measurements, what species of flower is it?

We are predicting the iris flower species as one of 3 known species.

85. Anusha Vidapanakal March 30, 2017 at 3:58 am #

LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)

Why am I getting the highest accuracy for SVM?

I’m a beginner, there was a similar query above but I couldn’t quite understand your reply.

• Jason Brownlee March 30, 2017 at 8:56 am #

Why is a very hard question to answer.

Our role is to find what works, ensure the results are robust, then figure out how we can use the model operationally.

• Anusha Vidapanakal March 30, 2017 at 11:33 pm #

Okay. Thanks a lot for the prompt response!

86. Vinay March 31, 2017 at 11:10 pm #

Great tutorial Jason!
My question is, if I want some new data from a user, how do I do that? If in future I develop my own machine learning algorithm, how do I use it to get some new data?
What steps are taken to develop it?
And thanks for this tutorial.

• Jason Brownlee April 1, 2017 at 5:56 am #

Not sure I understand. Collect new data from your domain and store it in a CSV or write code to collect it.

87. walid barakeh April 2, 2017 at 6:31 pm #

Hi Jason,
I have a question regards the step after trained the data and know the better algorithm for our case, how we could know the rules formula that the algorithm produced for future uses ?

and thanks for the tutorial, its really helpful

• Jason Brownlee April 4, 2017 at 9:06 am #

You can extract the weights if you like. Not sure I understand why you want the formula for the network. It would be complex and generally unreadable.

You can finalize the mode, save the weights and topology for later use if you like.

• walid barakeh April 5, 2017 at 7:40 pm #

the best algorithm results for my use case was the “Classification and Regression Trees (CART)”, so how could I know the rules that the algorithm created on my usecase.
how I could extract the weights and use them for evaluate new data .

88. Divya April 4, 2017 at 4:58 pm #

Thank you so much…this document really helped me a lot…..i was searching for such a document since a long time…this document gave the actual view of how machine learning is implemented through python….Books and courses are really difficult to understand completely and begin with development of project on such a vast concept… books n videos gave me lots of snippets, but i was not understanding how they all fit together.

89. Divya April 4, 2017 at 5:00 pm #

can i get such more tutorials for more detailed understanding?……..It will be really helpfull.

90. Gav April 11, 2017 at 5:17 pm #

Can’t load the iris dataset either through the url or copied to working folder without the NameError: name ‘pandas’ is not defined

• Jason Brownlee April 12, 2017 at 7:51 am #

You need to install the Pandas library.

• Gavin April 12, 2017 at 9:53 pm #

I’ve already installed Anaconda with Python 3.6 and the panda libraries are listed when I run versions.py. Everything has been fine up till trying to load the iris library. Do I need to use a different terminal within Anaconda?

• Jason Brownlee April 13, 2017 at 10:01 am #

You may need to close and re-open the terminal window, or maybe restart your system after installation.

• Sunil June 4, 2017 at 2:31 am #

import pandas
at the top

91. Ursula April 13, 2017 at 7:33 pm #

Hi Jason,

I’m trying to follow it but gets stuck on 5.3 Build Models

When I copy your code for this section I get a few Errors
IndentationError: excpected an indented block
NameError: name ‘model’ is not defined
NameError: name ‘cv_results’ is not defined
NameError: name ‘name’ is not defined

Thanks!

see the code and my “results” below:

>>> # Spot Check Algorithms
… models = []
>>> models.append((‘LR’, LogisticRegression()))
>>> models.append((‘LDA’, LinearDiscriminantAnalysis()))
>>> models.append((‘KNN’, KNeighborsClassifier()))
>>> models.append((‘CART’, DecisionTreeClassifier()))
>>> models.append((‘NB’, GaussianNB()))
>>> models.append((‘SVM’, SVC()))
>>> # evaluate each model in turn
… results = []
>>> names = []
>>> for name, model in models:
… kfold = model_selection.KFold(n_splits=10, random_state=seed)
File “”, line 2
kfold = model_selection.KFold(n_splits=10, random_state=seed)
^
IndentationError: expected an indented block
>>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘model’ is not defined
>>> results.append(cv_results)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘cv_results’ is not defined
>>> names.append(name)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘name’ is not defined
>>> msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘name’ is not defined
>>> print(msg)

• Jason Brownlee April 14, 2017 at 8:43 am #

Make sure you have the same tab indenting as in the example. Maybe re-add the tabs yourself after you copy-paste the code.

• Nathan Wilson March 26, 2018 at 11:16 am #

I’m having this same problem. How would I add the Indentations after I paste the code? Whenever I paste the code, it automatically executes the code.

• Jason Brownlee March 26, 2018 at 2:27 pm #

How to copy code from the tutorial:

1. Click the copy button on the code example (top right of code box, second from the end). This will select all code in the box.
2. Copy the code to the cipboard (control-c on windows, command-c on mac, or right click and click copy).
4. Paste the code from the clip board.

This will preserve all white space.

Does that help?

92. Davy April 14, 2017 at 10:14 pm #

Hi, one beginner question. What do we get after training is completed in supervised learning, for classification problem ? Do we get weights? How do i use the trained model after that in field, for real classification application lets say? I didn’t get the concept what happens if training is completed. I tried this example: https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py and it printed me accuracy and loss of test data. Then what now?

93. Manikandan April 14, 2017 at 11:36 pm #

Wow… It’s really great stuff man…. Thanks you….

94. Wes April 15, 2017 at 3:16 am #

As a complete beginner, it sounds so cool to predict the future. Then I saw all these model and complicated stuff, how do I even begin. Thank you for this. It is really great!

95. Manjushree Aithal April 16, 2017 at 7:41 am #

Hello Jason,

I just started following your step by step tutorial for machine learning. In importing libraries step I followed each and every steps you specified, install all libraries via conda, but still I’m getting the following error.

Traceback (most recent call last):
from sklearn.linear_model import LogisticRegression
File “C:\Users\dell\Anaconda2\lib\site-packages\sklearn\linear_model\__init__.py”, line 15, in
from .least_angle import (Lars, LassoLars, lars_path, LarsCV, LassoLarsCV,
File “C:\Users\dell\Anaconda2\lib\site-packages\sklearn\linear_model\least_angle.py”, line 24, in
from ..utils import arrayfuncs, as_float_array, check_X_y
ImportError: DLL load failed: Access is denied.

Thank You!

• Jason Brownlee April 16, 2017 at 9:33 am #

I have not seen this error and I don’t know about windows sorry.

It looks like you might not have admin permissions on your workstation.

96. Olah Data Semarang April 17, 2017 at 3:03 pm #

Tutorial DEAP Version 2.1
A Data Envelopment Analysis (Computer) Program. This page describes the computer program Tutorial DEAP Version 2.1 which was written by Tim Coelli.

97. Federico Carmona April 18, 2017 at 4:41 am #

Good afternoon Dr. Jason could help me with the next problem. How could you modify the KNN algorithm to detect the most relevant variables?

• Jason Brownlee April 18, 2017 at 8:34 am #

You can use feature importance scores from bagged trees or gradient boosting.

Consider using sklearn to calculate and plot feature importance.

98. Bharath April 18, 2017 at 10:09 pm #

Thank u…

99. Amal April 26, 2017 at 6:14 pm #

Hi Jason

Thanx for the great tutorial you provided.
I’m also new to MC and python. I tried to use my csv file as you used iris data set. Though it successfully loaded the dataset gives following error.

could not convert string to float: LipCornerDepressor

LipCornerDepressor is normal value such as 0.32145 in excel sheet taken from sql server

Here is the code without library files.

url = “F:\FINAL YEAR PROJECT\Amila\FTdata.csv”
names = [‘JawLower’, ‘BrowLower’, ‘BrowRaiser’, ‘LipCornerDepressor’, ‘LipRaiser’,’LipStretcher’,’Emotion_Id’]

# shape
print(dataset.shape)

# class distribution
print(dataset.groupby(‘Emotion_Id’).size())

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)

# Test options and evaluation metric
seed = 7
scoring = ‘accuracy’

# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))

# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)

• Jason Brownlee April 27, 2017 at 8:37 am #

This error might be specific to your data.

Consider double checking that your data is loaded as you expect. Maybe print some raw data or plots to confirm.

100. Chanaka April 27, 2017 at 6:31 am #

Thank you very much for the easy to follow tutorial.

• Jason Brownlee April 27, 2017 at 8:48 am #

I’m glad you found it useful.

101. Sonali Deshmukh April 27, 2017 at 7:07 pm #

Hi, Jason

I’m very naive to Python and Machine Learning.
Can you please suggest good reads to get basic clear for machine learning.

102. lanndo April 28, 2017 at 2:26 am #

Outstanding work on this. I am curious how to port out results that show which records were matched to what in the predictor, when I print(predictions) it does not show what records they are paired with. Thanks!

• Jason Brownlee April 28, 2017 at 7:51 am #

Thanks!

The index can be used to align predictions with inputs. For example, the first prediction is for the first input, and so on.

103. NAVKIRAN KAUR April 29, 2017 at 4:28 pm #

when I am applying all the models and printing message it shows me the error that it cannot convert string to float. how to resolve this error. my data set is related to fake news … title, text, label

• Jason Brownlee April 30, 2017 at 5:27 am #

Ensure you have converted your text data to numerical values.

104. Shravan May 1, 2017 at 6:29 am #

Awesome tutorial on basics of machine learning using Python. Thank you Jason!

105. Shravan May 1, 2017 at 6:36 am #

Am using Anaconda Python and I was writing all the commands/ program in the ‘python’ command line, am trying to find a way to save this program to a file? I have tried ‘%save’, but it errored out, any thoughts?

• Jason Brownlee May 2, 2017 at 5:51 am #

You can write your programs in a text file then run them on the command line as follows:

106. Jason May 1, 2017 at 2:05 pm #

Thank you for the help and insight you provide. When I run the actual validation data through the algorithms, I get a different feel for which one may be the best fit.

Validation Test Accuracy:
LR…….0.80
LDA…..0.97
KNN….0.90
CART..0.87
NB…….0.83
SVM….0.93

My question is, should this influence my choice of algorithm?

Thank you again for providing such a wealth of information on your blog.

107. rahman May 3, 2017 at 11:09 pm #

# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]

from my dataset , When i give Y=array[:,1] Its working , but if give 2 or 3 or 4 instead of 1 it gives following error !!
But all columns have similar kind of data .

Traceback (most recent call last):
File “/alok/c-analyze/analyze.py”, line 390, in
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
for train, test in cv_iter)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
while self.dispatch_one_batch(iterator):
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
result = ImmediateResult(func)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
self.results = batch()
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 468, in fit
self._solve_svd(X, y)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 378, in _solve_svd
fac = 1. / (n_samples – n_classes)

ZeroDivisionError: float division by zero

• Jason Brownlee May 4, 2017 at 8:08 am #

Perhaps take a closer look at your data.

• rahman May 4, 2017 at 4:29 pm #

But the very similar in all the columns .

• rahman May 4, 2017 at 4:37 pm #

I meant there is no much difference in data from each columns ! but still its working only for first column !! It gives the above error for any other column i choose .

• rahman May 4, 2017 at 4:46 pm #

Have a look at the data :

index,1column,2 column,3column,….,8column
0,238,240,1103,409,1038,4,67,0
1,41,359,995,467,1317,8,71,0
2,102,616,1168,480,1206,7,59,0
3,0,34,994,181,1115,4,68,0
4,88,1419,1175,413,1060,8,71,0
5,826,10886,1316,6885,2086,263,119,0
6,88,472,1200,652,1047,7,64,0
7,0,322,957,533,1062,11,73,0
8,0,200,1170,421,1038,5,63,0
9,103,1439,1085,1638,1151,29,66,0
10,0,1422,1074,4832,1084,27,74,0
11,1828,754,11030,263845,1209,10,79,0
12,340,1644,11181,175099,4127,13,136,0
13,71,1018,1029,2480,1276,18,66,1
14,0,3077,1116,1696,1129,6,62,0

“”””””
‘”””””
Total 105 data records

But the above error does not occur for 1 column , that is when Y = 1 column,
But the above same error happens when i choose any other column 2 , 3 or 4 .

108. hairo May 3, 2017 at 11:13 pm #

How to plot the graph for actual value against the predicted value here ?

How to save this plotted graphs and again view them back when required from terminal itself ?

• Jason Brownlee May 4, 2017 at 8:08 am #

It would make for a dull graph as this is a classification problem.

You might be better of reviewing the confusion matrix of a set of predictions.

109. Sudarshan May 5, 2017 at 12:18 pm #

How this can be applied to predict the value if stastical dataset is given
Say i have given with past 10 years house price now i want to predict the value for house in next one year, two year

Can you help me out in this

I m amature in ML

Thank for this tutorial
It gives me a good kickstart to ML

• Jason Brownlee May 6, 2017 at 7:30 am #

This is called a time series forecasting problem.

https://machinelearningmastery.com/start-here/#timeseries

• Sudarshan May 6, 2017 at 3:15 pm #

Example I have a dataset containing plumber work Say
attributes are
experience_level , date, rating, price/hour
I want to predict the price/hour for the next date base on experience level and average rating can you please help me regarding this.﻿

• Jason Brownlee May 7, 2017 at 5:34 am #

Sorry, I cannot write an example for you.

110. Bane May 8, 2017 at 4:30 am #

Great job with the tutorial, it was really helpful.

I want to ask, how can I use the techics above with a dataset that is not just one line with a few values, but a matrix NX3 with multiple values (measurements from an accelerometer). Is there a tutorial? How can I look up to it?

• Jason Brownlee May 8, 2017 at 7:46 am #

Each feature would be a different input variable as in the example above.

111. Shud May 9, 2017 at 12:04 am #

Hey Jason,

I have built a linear regression model. y intercept is abnormally high (0.3 million) and adjusted r2 = 0.94. I would like to know what does high intercept mean?

• Jason Brownlee May 9, 2017 at 7:45 am #

Think of the intercept as the bias term.

Many books have been written on linear regression and much is known about how to analyze these models effectively. I would recommend diving into the statistics literature.

112. MK May 11, 2017 at 12:19 am #

Excellent tutorial, i am moving from PHP to Python and taking baby steps. I used the Thonny IDE (http://thonny.org/) which is also very useful for python beginners.

113. Tmoe May 14, 2017 at 4:31 am #

Thank you so much, Jason! I’m new to machine learning and python but found your tutorial extremely helpful and easy to follow – thank you for posting!

• Jason Brownlee May 14, 2017 at 7:32 am #

Thanks Tmoe, I’m really glad to hear that!

114. melody12ab May 15, 2017 at 6:07 pm #

Thanks for all,now I am starting use ML!!!

115. smith May 15, 2017 at 9:36 pm #

# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))

When i print models , this is the output :

[(‘LR’, LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,
penalty=’l2′, random_state=None, solver=’liblinear’, tol=0.0001,
verbose=0, warm_start=False)), (‘LDA’, LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
solver=’svd’, store_covariance=False, tol=0.0001)), (‘KNN’, KNeighborsClassifier(algorithm=’auto’, leaf_size=30, metric=’minkowski’,
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights=’uniform’))

What are these extra values inside LogisticRegression (…) and for all the other algorithms ?

How did they get appended ?

116. pasha May 15, 2017 at 9:45 pm #

When i print kfold :

KFold(n_splits=7, random_state=7, shuffle=False)

What is shuffle ? How did this value get added , as we had only done this :

kfold = model_selection.KFold(n_splits=10, random_state=seed)

• Jason Brownlee May 16, 2017 at 8:44 am #

Whether or not to shuffle the dataset prior to splitting into folds.

• pasha May 16, 2017 at 3:17 pm #

Now i understand , jason thanks for amazing tutorials . Just one suggestion along with the codes give a link for reference in detail about this topics !

117. sita May 15, 2017 at 9:48 pm #

Hello jason

This is an amazing blog , Thank you for all the posts .

cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)

Whats scoring here ? can you explain in detail ” model_selection.cross_val_score ” this line please .

118. rahman May 15, 2017 at 10:27 pm #

ERROR :

Traceback (most recent call last):
File “/rahman/c-analyze/analyze.py”, line 390, in
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
for train, test in cv_iter)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
while self.dispatch_one_batch(iterator):
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
result = ImmediateResult(func)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
self.results = batch()
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 468, in fit
self._solve_svd(X, y)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 378, in _solve_svd
fac = 1. / (n_samples – n_classes)

ZeroDivisionError: float division by zero

# Split-out validation dataset

My code :

array = dataset.values
X = array[:,0:4]

if field == “rh”: #No error if i select this col
Y = array[:,0]

elif field == “rm”: #gives the above error
Y = array[:,1]

elif field == “wh”: #gives the above error
Y = array[:,2]

elif field == “wm”: #gives the above error
Y = array[:,3]

Have a look at the data :

index,1column,2 column,3column,….,8column
0,238,240,1103,409,1038,4,67,0
1,41,359,995,467,1317,8,71,0
2,102,616,1168,480,1206,7,59,0
3,0,34,994,181,1115,4,68,0
4,88,1419,1175,413,1060,8,71,0
5,826,10886,1316,6885,2086,263,119,0
6,88,472,1200,652,1047,7,64,0
7,0,322,957,533,1062,11,73,0
8,0,200,1170,421,1038,5,63,0
9,103,1439,1085,1638,1151,29,66,0
10,0,1422,1074,4832,1084,27,74,0
11,1828,754,11030,263845,1209,10,79,0
12,340,1644,11181,175099,4127,13,136,0
13,71,1018,1029,2480,1276,18,66,1
14,0,3077,1116,1696,1129,6,62,0

“”””””
‘”””””
Total 105 data records

But the above error does not occur for 1 column , that is when Y = 1 column,

But the above same error happens when i choose any other column 2 , 3 or 4 .

• Jason Brownlee May 16, 2017 at 8:45 am #

Perhaps try another algorithm?

119. suma May 16, 2017 at 12:05 am #

fac = 1. / (n_samples – n_classes)

ZeroDivisionError: float division by zero

What is this error : fac = 1. / (n_samples – n_classes) ?

Where is n_samples and n_classes used ?

What may be the possible reason for this error ?

120. bob May 22, 2017 at 6:46 pm #

thank you Dr Jason it is really very helpfully. 🙂

• Jason Brownlee May 23, 2017 at 7:50 am #

You’re welcome bob, I’m glad to hear that!

121. Krithika May 24, 2017 at 12:24 am #

Hi Jason
Great starting tutorial to get the whole picture. Thank you:)
I am a newbie to machine learning. Could you please tell why you have specifically chosen these 6 models?

• Jason Brownlee May 24, 2017 at 4:57 am #

No specific reason, just a demonstration of spot checking a suite of methods on the problem.

122. Ram Gour May 25, 2017 at 8:24 pm #

Hi Jason, I am new to Python, but found this blog really helpful. I tried executing the code and it return all the result as mention above by you, except few graph.
The scatter matrix graph and the evaluation on 6 algorithm did not open on my machine but its showing result on my colleague machine. I checked all the version and its higher or same as you mentioned in blog.
Can you help if this issue can be resolved on my machine?

• Jason Brownlee June 2, 2017 at 11:44 am #

Perhaps check the configuration of matplotlib and ensure you can create simple graphs on your machine?

123. sridhar May 25, 2017 at 8:50 pm #

Great tutorial.

How do I approach when the data set is not of any classification type and the number of attributes or just 2 – 1 is input and the other is output

say I have number of processes as input and cpu usage as output..
data set looks like [10, 5] [15, 7] etc…

• Jason Brownlee June 2, 2017 at 11:45 am #

If the output is real-valued, it would be a regression problem. You would need to use a loss function like MSE.

124. pierre May 27, 2017 at 9:45 pm #

Many thanks for this — I already got a lot out of this. I feel like a monkey though because I was neither familiar enough with python nor had any clue of ML back alleys yesterday. Today I can see plots on my screen and even if I have no clue what I’m looking at, this is where I wanted to be, so thanks!

A few minor suggestions to make this perhaps even more dummy-proof:

– I’m on Mac and I used python3 because python2 is weirdly set up out of the box and you can’t update easily the libraries needed. I understand you link, rightfully to external installation instructions, so just to say, this stuff works in python3 if you needed further testimony.

– when drawing plots, I started freaking out because the terminal became unresponsive. So if you just made an (unessential) suggestion to run plt.ion() first, linking to, for example: https://matplotlib.org/faq/usage_faq.html#what-is-interactive-mode, it might help dummies like me to not give up too easily. (BTW I find your use command line philosophy and don’t let toolsets get in the way a great one indeed!)

– There seems to be some ‘hack’ involved when defining the dataset, suppose there are no headers and so on… how do you get to load your dataset with an insightful name vector in the first palce (you don’t…) So just a hint of clarification would help here feeling we can trust that we do the right thing in this case because the data is well understood (I mean, this is not really a big deal eh it’s all par for the course but if I didn’t have similar experience in R I’d feel completely lost I think).

I was a bit puzzled by the following sentence in 3.3:

“We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.”

Well, just looking at the table, I actually can’t see any of this. There is in fact really nothing telling this to us in the snippet, right? The sentence is a comment based on prior understanding of the dataset. Maybe this could be clarified so clueless readers don’t agonise over whether they are missing some magical power of insight.

– Overall, I could run this and to some extent adapt it quickly to a different dataset until it became relevant what the data was like. I’m stumbling on the data manipulation for 5.1. I suppose it is both because I don’t know python structures and also because I have no clue what is being done in the selection step.

I think in answer to a previous comment you link to doc for the relevant selection function, perhaps it would still be useful to have an extra, ‘for dummies’, detailed explanation of

X = array[:,0:4]
Y = array[:,4]

in the context of the iris dataset. This is what I have to figure out, I think, in order to apply it to say, a 11 column dataset and it would be useful to known what I’m trying to do.

The rest of the difficulties I have are with regards to interpretation of the output and it is fair to say this is outside of the scope of your tutorial which puts dummies like me in a very good position to try to understand while being able to fiddle with a bit of code. All the above comments are extremely minor and really about polishing the readibility for ultimate noobs, they are not really important and your tutorial is a great and efficient resource.

Thanks again!
Pierre

• Jason Brownlee June 2, 2017 at 12:04 pm #

Wonderful feedback pierre, thank you so much!

125. Shaksham Kapoor June 6, 2017 at 4:18 am #

I’m not able to figure out , what errors does the confusion matrix represents ? and what does each column(precision, recall, f1-score, support) in the classification report signifies ?

And last but not the least thanks a lot Sir for this easy to use and wonderful tutorial. Even words are not enough to express my gratitude, you have made a daunting task for every ML Enthusiast a hell lot easier !!!

• Jason Brownlee June 6, 2017 at 10:07 am #

https://machinelearningmastery.com/confusion-matrix-machine-learning/

• Shaksham Kapoor June 7, 2017 at 3:39 am #

Thanks a lot Sir. Please suggest some data-sets from UCL repository on which I can practice some small projects…

• Jason Brownlee June 7, 2017 at 7:26 am #
• Shaksham Kapoor June 7, 2017 at 6:48 pm #

How do you classify problem into different categories example : Iris dataset was a classification problem and pima-indian-diabetes ,a binary problem. How can we figure out which problem belong to which category and which model to apply on that problem?

• Jason Brownlee June 8, 2017 at 7:40 am #

By careful evaluation of the output variable.

126. Brian June 6, 2017 at 11:11 pm #

Is this machine learning？ what does the machine learn in this example? This is just plain Statistics, used in a weird way…

• Jason Brownlee June 7, 2017 at 7:14 am #

Yes, it is.

Nominally, statistics is about understanding the data, machine learning about making predictions at the cost of understanding.

• Raj June 9, 2017 at 2:22 am #

consider the formula for area of triangle 1/2 x base x height. When you learn this formula, you understand it and apply it many times for different triangles. BUT you did not learn anything ABOUT the formula itself. . for instance, how many people care that the formula has 2 variables(base and height) and that there is no CONSTANT(like PI) in the formula and many such things about the formula itself? Applying the formula does not teach anything about the nature of the formula itself

A lot of program execution in computers happen much the same way…data is a thing to be modified, applied or used, but not necessarily understood. When you introduce some techniques to understand data, then necessarily the computer or the ‘Machine’ ‘learns’ that there are characteristics about that data, and that at the least, there exists some relationship amongst data in their dataset. This learning is not explicitly programmed rather inferenced, although confusingly, the algorithms themselves are explicitly programmed to infer the meaning of the dataset. The learning is then transferred to the end cycle of making prediction based on the gained understanding of data.

but like you pointed out, it is still statistics and all it’s domain techniques, but as a statistician do you not ‘learn’ more about data than merely use it, unlike your counterparts who see data more as a commodity to be consumed? Because most computer systems do the latter(consumption) rather than the former(data understanding), a system that understands data(with prediction used as a proof of learning) can be called ‘Machine Learning’.

127. Alex June 7, 2017 at 6:04 am #

Thanks for good tutorial Jason.

Only issue I encountered is following error while cross validation score calculation for model KNeighborsClassifier() :

AttributeError: ‘NoneType’ object has no attribute ‘issparse’

Is somebody got same error? How it can be solved?

I have installed following versions of toos:
Python: 2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]
scipy: 0.19.0
numpy: 1.12.1
matplotlib: 2.0.0
pandas: 0.19.2
sklearn: 0.18.1

Thanks,
Alex

• Jason Brownlee June 7, 2017 at 7:27 am #

Ouch, sorry I have not seen this issue. Perhaps search on stackoverflow?

128. thanda June 8, 2017 at 6:31 pm #

HI, Jason!
How can i get the xgboost algorithm in pseudo code or in code?

129. Shaksham Kapoor June 9, 2017 at 1:14 am #

Sir,I’ve been working on bank_note authentication dataset and after applying the above procedure carefully the results were 100% accuracy(both on trained and validation dataset) using SVM and KNN models. Is 100% accuracy possible or have I done something wrong ?

• Jason Brownlee June 9, 2017 at 6:27 am #

That sounds great.

If I were to get surprising results, I would be skeptical of my code/models.

Work hard to ensure your system is not fooling you. Challenge surprising results.

• Shaksham Kapoor June 9, 2017 at 3:10 pm #

Sir, I’ve considered various other aspects like f1-score, recall, support ; but in each case the result is same 100%. How can I make sure that my system is not fooling me ? What other procedure can I apply to check the accuracy of my dataset ?

• Jason Brownlee June 10, 2017 at 8:13 am #

Get more data and see if the model can make accurate predictions.

130. Rejeesh R June 9, 2017 at 7:27 pm #

Hi, Jason!
I am new to python as well ML. so I am getting the below error while running your code, please help me to code bring-up

File “sample1.py”, line 73, in
predictions = knn.predict(X_validation)
File “/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py”, line 143, in predict
X = check_array(X, accept_sparse=’csr’)
File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”, line 407, in check_array
_assert_all_finite(array)
File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”, line 58, in _assert_all_finite
” or a value too large for %r.” % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).

and my config

Python: 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4]
scipy: 0.13.3
numpy: 1.8.2
matplotlib: 1.3.1
pandas: 0.13.1
sklearn: 0.18.1
running in Ubuntu Terminal.

• Jason Brownlee June 10, 2017 at 8:20 am #

You may have a NaN value in your dataset. Check your data file.

131. Sats S June 10, 2017 at 5:27 am #

Hello. This is really an amazing tutorial. I got down to everything but when selecting the best model i hit a snag. Can you help out?

Traceback (most recent call last):
File “/Users/sahityasehgal/Desktop/py/machinetest.py”, line 77, in
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
for train, test in cv_iter)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
while self.dispatch_one_batch(iterator):
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
result = ImmediateResult(func)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
self.results = batch()
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/linear_model/logistic.py”, line 1173, in fit
order=”C”)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py”, line 526, in check_X_y
y = column_or_1d(y, warn=True)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py”, line 562, in column_or_1d
ValueError: bad input shape (94, 4)

132. Rene June 11, 2017 at 1:25 am #

Very insightful Jason, thank you for the post!

I was wondering if the models can be saved to/loaded from file, to avoid re-training a model each time we wish to make a prediction.

Thanks,

Rene

133. Richard Bruning June 12, 2017 at 11:42 am #

Mr. Brownlee,

This is, by far, is the most effective applied technology tutorial I have utilized.

You get right to the point and still have readers actually working with python, python libraries, IDE options, and of course machine learning. I am an electromechanical engineer with embedded C experience. Until now, I have been bogged down trying to traipse through python wizards’ idiosyncratic coding styles and verbose machine learning theory knowing there exists a friendlier path.

Thank you for showing me the way!

Rich

• Jason Brownlee June 13, 2017 at 8:13 am #

134. Praver Vats June 13, 2017 at 7:21 pm #

This was very informative….Thank You !

Actually I was working on a project on twitter analysis using python where I am extracting user interests through their tweets. I was thinking of using naive bayes classifier in textblob python library for training classifier with different type of pre-labeled tweets or different categories like politics,sports etc.
My only concern is that will it be accurate as I tried passing like 10 tweets in training set and based on that I tried classifying my test set. I am getting some false cases and accuracy is around 85.

• Jason Brownlee June 14, 2017 at 8:44 am #

Good question, I’d suggest try it and see.

135. Kush Singh Kushwaha June 14, 2017 at 4:14 am #

Hi Jason,

This was great example. I was looking for something similar on internet all this time,glad I found this link. I wanted to compile a ML code end-to-end and see my basic infra is ready to start with the actual course work. As you said, from here we can learn more about each algorithm in detail. It would be great if you can start a Youtube channel and upload some easy to learn videos as well related to ML, Deep learning and Neural Networks.

Regards,
Kush Singh

• Jason Brownlee June 14, 2017 at 8:51 am #

Thanks.

Take a look at the rest of my blog and my books. I am dedicated to this mission.

136. Shaksham Kapoor June 14, 2017 at 4:34 am #

I’ve been working on a dataset which contains [Male,Female,Infant] as entries in first column rest all columns are integers. How can I replace [Male,Female,Infant] with a similar notation like [0,1,2] or something like that ? What is the most efficient way to do it ?

• Jason Brownlee June 14, 2017 at 8:51 am #

Excellent question.

I’m sure I have tutorials on this on my blog, try the blog search.

137. Dev June 14, 2017 at 12:52 pm #

• Jason Brownlee June 15, 2017 at 8:42 am #

Change the URL to a filename and path.

138. Vincent June 18, 2017 at 2:26 am #

Hi,

Nice tutorial, thanks!
Just a little precision if someone encounter the same issue than me:
if you get the error “This application failed to start because it could not find or load the Qt platform plugin “windows”
in “”.” when you are trying to see your data visualizations, it’s maybe (like in my case) because you are using PySide rather than PyQT.
In that case, add these lines before the “import matplotlib.pyplot as plt”:

import matplotlib
matplotlib.use(‘Qt4Agg’)
matplotlib.rcParams[‘backend.qt4′]=’PySide’

Hope this will help

139. Danielle June 25, 2017 at 5:43 pm #

Fantastic tutorial! Running today I noticed two changes from the tutorial above (undoubtably because time has passed since it was created). New users might find the following observations useful:

#1 – Future Warning

Ran on OS X, Python 3.6.1, in a jupyter notebook, anaconda 4.4.0 installed:
scipy: 0.19.0
numpy: 1.12.1
matplotlib: 2.0.2
pandas: 0.20.1
sklearn: 0.18.1

I replaced this line in the #Load libraries code block:
from pandas.tools.plotting import scatter_matrix

With this:
from pandas.plotting import scatter_matrix

…because a FutureWarning popped up:
/Users/xxx/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: ‘pandas.tools.plotting.scatter_matrix’ is deprecated, import ‘pandas.plotting.scatter_matrix’ instead.

Note: it does run perfectly even without this fix, this may be more of an issue in the future

#2 – SVM wins!

In the build models section, the results were:
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.966667 (0.040825)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)

… which means SVM was better here. I added the following code block based on the KNN one:
# Make predictions on validation dataset
svm = SVC()
svm.fit(X_train, Y_train)
predictions = svm.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

which gets these results:
0.933333333333
[[ 7 0 0]
[ 0 10 2]
[ 0 0 11]]
precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 1.00 0.83 0.91 12
Iris-virginica 0.85 1.00 0.92 11

avg / total 0.94 0.93 0.93 30

I did also run the unmodified KNN block – # Make predictions on validation dataset – and got the exact results that were in the tutorial.

Excellent tutorial, very clear, and easy to modify 🙂

• Jason Brownlee June 26, 2017 at 6:06 am #

Thanks for sharing Danielle.

• abhilash April 2, 2020 at 12:34 am #

precision recall f1-score support

Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 1.00 0.83 0.91 12
Iris-virginica 0.85 1.00 0.92 11

how to relate this result with input ? I mean, can i interactively provide the values for sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width and result to get whether it which class ?

140. mr. disapointed June 26, 2017 at 10:06 pm #

So this intro shows how to set everything up but not the actual interesting bit how to use it?

141. Aditya June 28, 2017 at 4:48 pm #

Excellent tutorial sir, I love your tutorials and I am starting with deep learning with keras.
I would love if you could provide a tutorial for sequence to sequence model using keras and a relevant dataset.
Also I would be obliged if you could point me in some direction towards names entity recognition using seq2seq

142. RATNA June 30, 2017 at 4:19 am #

Hi Jason,

Awesome tutorial. I am working on PIMA dataset and while using the following command

I am getting NAN. HEPL ME.

• Jason Brownlee June 30, 2017 at 8:18 am #

Confirm you downloaded the dataset and that the file contains CSV data with nothing extra or corrupted.

• RATNA June 30, 2017 at 4:14 pm #

Hi Jason,

I downloaded the dataset from UCI which is a CSV file but still I get NAN.

# Load dataset url = “https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data”

Thanks..

• Jason Brownlee July 1, 2017 at 6:27 am #

Sorry, I do not see how this could be. Perhaps there is an issue with your environment?

143. Deepak July 2, 2017 at 1:50 am #

Hello Jason,
Thank you for a great tutorial.

I have noticed something , which I would like to share with you.

I have tried with random_state = 4
“X_train,X_validation,Y_train,Y_validation = model_selection.train_test_split(X,Y, test_size = 0.2, random_state = 4)”

and surprisingly now “LDA” has the best accuracy.

LR: 0.966667 (0.040825)
LDA: 0.991667 (0.025000)
KNN: 0.975000 (0.038188)
CART: 0.958333 (0.055902)
NB: 0.950000 (0.055277)
SVM: 0.983333 (0.033333)

Any thoughts on this?

144. Rui July 3, 2017 at 12:31 pm #

Hi Jason,

Thanks for your great example, this is really helpful, this end-to-end project is the best way to learn ML, much better than text-book which they only focus on the seperate concepts, not the whole forest, will you please do more example like this and explain in detail next time?

Thanks,

Rui

145. Vaibhav July 4, 2017 at 4:33 pm #

__init__() got an unexpected keyword argument ‘n_splites’

I am getting this error while running the code upto “print(msg)” commmand.

• Jason Brownlee July 6, 2017 at 10:12 am #

Update your version of sklearn to 0.18 or higher.

146. Fahad Ahmed July 5, 2017 at 12:31 am #

This is beautiful tutorial for the starters..
I am a lover of machine learning and want to do some projects and research on it.
I would really need your help and guideline time to time.

Regards,

147. Neal Valiant July 12, 2017 at 9:08 am #

Hi Jason,
Love the article. gave me a good start of understanding machine learning. One thing i would like to ask is what is the predicted outcome? Is it which type or “class” of flower that will happen next? i assume switching things up I could use this same outline as a way of getting a prediction on the other columns involved?

• Jason Brownlee July 12, 2017 at 9:55 am #

Yes, the prediction is a number that maps to a specific class of flower (string).

Correct, from the class and other measures you could predict width or something.

• Neal July 13, 2017 at 3:50 am #

Hi again Jason,
Diving deeper into this tutorial and analyzing more I find something that peaked an interest maybe you can shed light on. based off the seed of 7 you get a higher accuracy percentage on the KNN algorithm after using kfold, but when showing the information for the LDA algorithm, it has a higher percentage in accuracy_score after predicting on it. what could this mean?

• Jason Brownlee July 13, 2017 at 9:59 am #

Machine learning algorithms are stochastic.

It is important to develop a robust estimate of the performance of machine learning models on unseen data using repeats. See this post:
https://machinelearningmastery.com/evaluate-skill-deep-learning-models/

• Neal July 13, 2017 at 11:22 am #

Another great read Jason. This whole site is full of great pieces and it gives me a good answer on my question. I want to thank you for your time and effort into making such a great place for all this knowledge.

• Jason Brownlee July 13, 2017 at 4:54 pm #

Thanks, I’m glad it helps Neal. Stick with it!

148. Thomas July 14, 2017 at 8:10 pm #

Hello Jason,

At the beginning of your tutorial you write: “If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.”
No offense but in this regards, your tutorial is not doing a very good job.
You don’t really go in detail so that we can understand what is been done and why. The explanations are rather weak.
Wrong expectations set i believe.

Cheers,

Thomas

• Jason Brownlee July 15, 2017 at 9:43 am #

It is a starting point, not a panacea.

Sorry that it’s not a good fit for you.

149. Mariah July 15, 2017 at 7:11 am #

Hi Jason! I am trying to adapt this for a purely binary dataset, however I’m running into this problem:
# evaluate each model in turn
results = []
name = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train,cv = kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s:%f(%f)”%(name, cv_results.mean(), cv_results.std())
print(msg)

I get the error:

raise ValueError(“Unknown label type: %r” % y_type)

ValueError: Unknown label type: ‘unknown’

Am I missing something, any help would be great!

• Mariah July 15, 2017 at 7:12 am #

All necessary indentations are correct, it just pasted incorrectly

• Jason Brownlee July 15, 2017 at 9:46 am #

You can wrap pasted code in pre tags.

• Jason Brownlee July 15, 2017 at 9:46 am #

Sorry, the fault is not obvious to me.

• Daniel September 12, 2017 at 1:14 am #

Hello Mariah,

Did you ever get a solution to this problem?

Jason..great guide here..THANKS!

150. Sreeram July 16, 2017 at 10:09 pm #

Hi. What should i do to make predictions based on my own test set.? Say i need to predict category of flower with data [5.2, 1.8, 1.6, 0.2]. ie i want to change my X_test to that array. And the prediction should be like “setosa”.

What changes should i do.? I tried giving that value directly to predict(). But it crashes.

• Jason Brownlee July 17, 2017 at 8:47 am #

Correct.

Fit the model on all available data. This is called creating a final model:
https://machinelearningmastery.com/train-final-machine-learning-model/

Then make your prediction on new data where you do not know the answer/outcome.

Does that help?

• Sreeram July 18, 2017 at 2:35 am #

Yes it helped. Can u show an example code for the same.?

• Jason Brownlee July 18, 2017 at 8:46 am #

Sure:

151. Joe July 18, 2017 at 7:49 am #

Hi Jason, i´m perú and i have to script write in Mac
#Configurar para la red neural
fechantinicio = ‘1970-01-01’
fechantfinal = ‘1974-12-31’
capasinicio = TodasEstaciones.ix[fechantinicio:fechantfinal].as_matrix()[:,[0,2,5]]
capasalida = TodasEstaciones.ix[fechantinicio:fechantfinal].as_matrix()[:,1]
#Construimos la Red Neural

from sknn.mlp import Regressor, Layer

neurones = 8
tasaaprendizaje = 0.0001
numiteraciones = 7000

#Definition of the training for the neural network
redneural = Regressor(
layers=[
Layer(“ExpLin”, units=neurones),
Layer(“ExpLin”, units=neurones), Layer(“Linear”)],
learning_rate=tasaaprendizaje,
n_iter=numiteraciones)
redneural.fit(capasinicio, capasalida)

#Get the prediction for the train set
valortest = ([])

for i in range(capasinicio.shape[0]):
prediccion = redneural.predict(np.array([capasinicio[i,:].tolist()]))
valortest.append(prediccion[0][0])

and then run…
ModuleNotFoundError Traceback (most recent call last)
in ()
1 #Construimos la Red Neural
2
—-> 3 from sknn.mlp import Regressor, Layer
4
5

ModuleNotFoundError: No module named ‘sknn’
i have install python in window 7 and i changed the script so:

#construimos la red neural
import numpy as np
from sklearn.neural_network import MLPRegressor

#definicion del entrenamiento para el trabajo de la red neural

redneural = MLPRegressor(