Last Updated on August 19, 2020
Do you want to do machine learning using Python, but you’re having trouble getting started?
In this post, you will complete your first machine learning project using Python.
In this step-by-step tutorial you will:
- Download and install Python SciPy and get the most useful package for machine learning in Python.
- Load a dataset and understand it’s structure using statistical summaries and data visualization.
- Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.
If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.
Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.
Let’s get started!
- Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
- Update Mar/2017: Added links to help setup your Python environment.
- Update Apr/2018: Added some helpful links about randomness and predicting.
- Update Sep/2018: Added link to my own hosted version of the dataset.
- Update Feb/2019: Updated for sklearn v0.20, also updated plots.
- Update Oct/2019: Added links at the end to additional tutorials to continue on.
- Update Nov/2019: Added full code examples for each section.
- Update Dec/2019: Updated examples to remove warnings due to API changes in v0.22.
- Update Jan/2020: Updated to remove the snippet for the test harness.

Your First Machine Learning Project in Python Step-By-Step
Photo by cosmoflash, some rights reserved.
How Do You Start Machine Learning in Python?
The best way to learn machine learning is by designing and completing small projects.
Python Can Be Intimidating When Getting Started
Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.
There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.
The best way to get started using Python for machine learning is to complete a project.
- It will force you to install and start the Python interpreter (at the very least).
- It will given you a bird’s eye view of how to step through a small project.
- It will give you confidence, maybe to go on to your own small projects.
Beginners Need A Small End-to-End Project
Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.
When you are applying machine learning to your own datasets, you are working on a project.
A machine learning project may not be linear, but it has a number of well known steps:
- Define Problem.
- Prepare Data.
- Evaluate Algorithms.
- Improve Results.
- Present Results.
The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.
If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.
Hello World of Machine Learning
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).
This is a good project because it is so well understood.
- Attributes are numeric so you have to figure out how to load and handle data.
- It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
- It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
- It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
- All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.
Let’s get started with your hello world machine learning project in Python.
Machine Learning in Python: Step-By-Step Tutorial
(start here)
In this section, we are going to work through a small machine learning project end-to-end.
Here is an overview of what we are going to cover:
- Installing the Python and SciPy platform.
- Loading the dataset.
- Summarizing the dataset.
- Visualizing the dataset.
- Evaluating some algorithms.
- Making some predictions.
Take your time. Work through each step.
Try to type in the commands yourself or copy-and-paste the commands to speed things up.
If you have any questions at all, please leave a comment at the bottom of the post.
Need help with Machine Learning in Python?
Take my free 2-week email course and discover data prep, algorithms and more (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
1. Downloading, Installing and Starting Python SciPy
Get the Python and SciPy platform installed on your system if it is not already.
I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.
1.1 Install SciPy Libraries
This tutorial assumes Python version 2.7 or 3.6+.
There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:
- scipy
- numpy
- matplotlib
- pandas
- sklearn
There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.
The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.
- On Mac OS X, you can use macports to install Python 3.6 and these libraries. For more information on macports, see the homepage.
- On Linux you can use your package manager, such as yum on Fedora to install RPMs.
If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.
Note: This tutorial assumes you have scikit-learn version 0.20 or higher installed.
Need more help? See one of these tutorials:
- How to Setup a Python Environment for Machine Learning with Anaconda
- How to Create a Linux Virtual Machine For Machine Learning With Python 3
1.2 Start Python and Check Versions
It is a good idea to make sure your Python environment was installed successfully and is working as expected.
The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.
Open a command line and start the python interpreter:
1 |
python |
I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.
Type or copy and paste the following script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Check the versions of libraries # Python version import sys print('Python: {}'.format(sys.version)) # scipy import scipy print('scipy: {}'.format(scipy.__version__)) # numpy import numpy print('numpy: {}'.format(numpy.__version__)) # matplotlib import matplotlib print('matplotlib: {}'.format(matplotlib.__version__)) # pandas import pandas print('pandas: {}'.format(pandas.__version__)) # scikit-learn import sklearn print('sklearn: {}'.format(sklearn.__version__)) |
Here is the output I get on my OS X workstation:
1 2 3 4 5 6 7 |
Python: 3.6.11 (default, Jun 29 2020, 13:22:26) [GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] scipy: 1.5.2 numpy: 1.19.1 matplotlib: 3.3.0 pandas: 1.1.0 sklearn: 0.23.2 |
Compare the above output to your versions.
Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.
If you get an error, stop. Now is the time to fix it.
If you cannot run the above script cleanly you will not be able to complete this tutorial.
My best advice is to Google search for your error message or post a question on Stack Exchange.
2. Load The Data
We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.
The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.
You can learn more about this dataset on Wikipedia.
In this step we are going to load the iris data from CSV file URL.
2.1 Import libraries
First, let’s import all of the modules, functions and objects we are going to use in this tutorial.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# Load libraries from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC ... |
Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.
2.2 Load Dataset
We can load the data directly from the UCI Machine Learning repository.
We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.
Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.
1 2 3 4 5 |
... # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) |
The dataset should load without incident.
If you do have network problems, you can download the iris.csv file into your working directory and load it using the same method, changing URL to the local file name.
3. Summarize the Dataset
Now it is time to take a look at the data.
In this step we are going to take a look at the data a few different ways:
- Dimensions of the dataset.
- Peek at the data itself.
- Statistical summary of all attributes.
- Breakdown of the data by the class variable.
Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.
3.1 Dimensions of Dataset
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
1 2 3 |
... # shape print(dataset.shape) |
You should see 150 instances and 5 attributes:
1 |
(150, 5) |
3.2 Peek at the Data
It is also always a good idea to actually eyeball your data.
1 2 3 |
... # head print(dataset.head(20)) |
You should see the first 20 rows of the data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
sepal-length sepal-width petal-length petal-width class 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa 5 5.4 3.9 1.7 0.4 Iris-setosa 6 4.6 3.4 1.4 0.3 Iris-setosa 7 5.0 3.4 1.5 0.2 Iris-setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 9 4.9 3.1 1.5 0.1 Iris-setosa 10 5.4 3.7 1.5 0.2 Iris-setosa 11 4.8 3.4 1.6 0.2 Iris-setosa 12 4.8 3.0 1.4 0.1 Iris-setosa 13 4.3 3.0 1.1 0.1 Iris-setosa 14 5.8 4.0 1.2 0.2 Iris-setosa 15 5.7 4.4 1.5 0.4 Iris-setosa 16 5.4 3.9 1.3 0.4 Iris-setosa 17 5.1 3.5 1.4 0.3 Iris-setosa 18 5.7 3.8 1.7 0.3 Iris-setosa 19 5.1 3.8 1.5 0.3 Iris-setosa |
3.3 Statistical Summary
Now we can take a look at a summary of each attribute.
This includes the count, mean, the min and max values as well as some percentiles.
1 2 3 |
... # descriptions print(dataset.describe()) |
We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.
1 2 3 4 5 6 7 8 9 |
sepal-length sepal-width petal-length petal-width count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.054000 3.758667 1.198667 std 0.828066 0.433594 1.764420 0.763161 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000 |
3.4 Class Distribution
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.
1 2 3 |
... # class distribution print(dataset.groupby('class').size()) |
We can see that each class has the same number of instances (50 or 33% of the dataset).
1 2 3 4 |
class Iris-setosa 50 Iris-versicolor 50 Iris-virginica 50 |
3.5 Complete Example
For reference, we can tie all of the previous elements together into a single script.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# summarize the data from pandas import read_csv # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # shape print(dataset.shape) # head print(dataset.head(20)) # descriptions print(dataset.describe()) # class distribution print(dataset.groupby('class').size()) |
4. Data Visualization
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
- Univariate plots to better understand each attribute.
- Multivariate plots to better understand the relationships between attributes.
4.1 Univariate Plots
We start with some univariate plots, that is, plots of each individual variable.
Given that the input variables are numeric, we can create box and whisker plots of each.
1 2 3 4 |
... # box and whisker plots dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) pyplot.show() |
This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots for Each Input Variable for the Iris Flowers Dataset
We can also create a histogram of each input variable to get an idea of the distribution.
1 2 3 4 |
... # histograms dataset.hist() pyplot.show() |
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Histogram Plots for Each Input Variable for the Iris Flowers Dataset
4.2 Multivariate Plots
Now we can look at the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.
1 2 3 4 |
... # scatter plot matrix scatter_matrix(dataset) pyplot.show() |
Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Scatter Matrix Plot for Each Input Variable for the Iris Flowers Dataset
4.3 Complete Example
For reference, we can tie all of the previous elements together into a single script.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
# visualize the data from pandas import read_csv from pandas.plotting import scatter_matrix from matplotlib import pyplot # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # box and whisker plots dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False) pyplot.show() # histograms dataset.hist() pyplot.show() # scatter plot matrix scatter_matrix(dataset) pyplot.show() |
5. Evaluate Some Algorithms
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Here is what we are going to cover in this step:
- Separate out a validation dataset.
- Set-up the test harness to use 10-fold cross validation.
- Build multiple different models to predict species from flower measurements
- Select the best model.
5.1 Create a Validation Dataset
We need to know that the model we created is good.
Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two, 80% of which we will use to train, evaluate and select among our models, and 20% that we will hold back as a validation dataset.
1 2 3 4 5 6 |
... # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1) |
You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.
Notice that we used a python slice to select the columns in the NumPy array. If this is new to you, you might want to check-out this post:
5.2 Test Harness
We will use stratified 10-fold cross validation to estimate model accuracy.
This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
Stratified means that each fold or split of the dataset will aim to have the same distribution of example by class as exist in the whole training dataset.
For more on the k-fold cross-validation technique, see the tutorial:
We set the random seed via the random_state argument to a fixed number to ensure that each algorithm is evaluated on the same splits of the training dataset.
The specific random seed does not matter, learn more about pseudorandom number generators here:
We are using the metric of ‘accuracy‘ to evaluate models.
This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.
5.3 Build Models
We don’t know which algorithms would be good on this problem or what configurations to use.
We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s test 6 different algorithms:
- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).
This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms.
Let’s build and evaluate our models:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
... # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) |
5.4 Select Best Model
We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
Running the example above, we get the following raw results:
1 2 3 4 5 6 |
LR: 0.960897 (0.052113) LDA: 0.973974 (0.040110) KNN: 0.957191 (0.043263) CART: 0.957191 (0.043263) NB: 0.948858 (0.056322) SVM: 0.983974 (0.032083) |
Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.
What scores did you get?
Post your results in the comments below.
In this case, we can see that it looks like Support Vector Machines (SVM) has the largest estimated accuracy score at about 0.98 or 98%.
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (via 10 fold-cross validation).
A useful way to compare the samples of results for each algorithm is to create a box and whisker plot for each distribution and compare the distributions.
1 2 3 4 5 |
... # Compare Algorithms pyplot.boxplot(results, labels=names) pyplot.title('Algorithm Comparison') pyplot.show() |
We can see that the box and whisker plots are squashed at the top of the range, with many evaluations achieving 100% accuracy, and some pushing down into the high 80% accuracies.

Box and Whisker Plot Comparing Machine Learning Algorithms on the Iris Flowers Dataset
5.5 Complete Example
For reference, we can tie all of the previous elements together into a single script.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 |
# compare algorithms from pandas import read_csv from matplotlib import pyplot from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.model_selection import StratifiedKFold from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1, shuffle=True) # Spot Check Algorithms models = [] models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr'))) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC(gamma='auto'))) # evaluate each model in turn results = [] names = [] for name, model in models: kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True) cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring='accuracy') results.append(cv_results) names.append(name) print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std())) # Compare Algorithms pyplot.boxplot(results, labels=names) pyplot.title('Algorithm Comparison') pyplot.show() |
6. Make Predictions
We must choose an algorithm to use to make predictions.
The results in the previous section suggest that the SVM was perhaps the most accurate model. We will use this model as our final model.
Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both of these issues will result in an overly optimistic result.
6.1 Make Predictions
We can fit the model on the entire training dataset and make predictions on the validation dataset.
1 2 3 4 5 |
... # Make predictions on validation dataset model = SVC(gamma='auto') model.fit(X_train, Y_train) predictions = model.predict(X_validation) |
You might also like to make predictions for single rows of data. For examples on how to do that, see the tutorial:
You might also like to save the model to file and load it later to make predictions on new data. For examples on how to do this, see the tutorial:
6.2 Evaluate Predictions
We can evaluate the predictions by comparing them to the expected results in the validation set, then calculate classification accuracy, as well as a confusion matrix and a classification report.
1 2 3 4 5 |
.... # Evaluate predictions print(accuracy_score(Y_validation, predictions)) print(confusion_matrix(Y_validation, predictions)) print(classification_report(Y_validation, predictions)) |
We can see that the accuracy is 0.966 or about 96% on the hold out dataset.
The confusion matrix provides an indication of the errors made.
Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).
1 2 3 4 5 6 7 8 9 10 11 12 13 |
0.9666666666666667 [[11 0 0] [ 0 12 1] [ 0 0 6]] precision recall f1-score support Iris-setosa 1.00 1.00 1.00 11 Iris-versicolor 1.00 0.92 0.96 13 Iris-virginica 0.86 1.00 0.92 6 accuracy 0.97 30 macro avg 0.95 0.97 0.96 30 weighted avg 0.97 0.97 0.97 30 |
6.3 Complete Example
For reference, we can tie all of the previous elements together into a single script.
The complete example is listed below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
# make predictions from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score from sklearn.svm import SVC # Load dataset url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv" names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class'] dataset = read_csv(url, names=names) # Split-out validation dataset array = dataset.values X = array[:,0:4] y = array[:,4] X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1) # Make predictions on validation dataset model = SVC(gamma='auto') model.fit(X_train, Y_train) predictions = model.predict(X_validation) # Evaluate predictions print(accuracy_score(Y_validation, predictions)) print(confusion_matrix(Y_validation, predictions)) print(classification_report(Y_validation, predictions)) |
You Can Do Machine Learning in Python
Work through the tutorial above. It will take you 5-to-10 minutes, max!
You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.
You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.
You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.
You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.
What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.
Summary
In this post, you discovered step-by-step how to complete your first machine learning project in Python.
You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.
Your Next Step
Do you work through the tutorial?
- Work through the above tutorial.
- List any questions you have.
- Search-for or research the answers.
- Remember, you can use the help(“FunctionName”) in Python to get help on any function.
Do you have a question?
Post it in the comments below.
More Tutorials?
Looking to continue to practice your machine learning skills, take a look at some of these tutorials:
Awesome… But in your Blog please introduce SOM ( Self Organizing maps) for unsupervised methods and also add printing parameters ( Coefficients )code.
I generally don’t cover unsupervised methods like clustering and projection methods.
This is because I mainly focus on and teach predictive modeling (e.g. classification and regression) and I just don’t find unsupervised methods that useful.
Jason,
Can you elaborate what you don’t find unsupervised methods useful?
Because my focus is predictive modeling.
DeprecationWarning: the imp module is deprecated in favour of importlib; see the module’s documentation for alternative uses
what is the error?
You can ignore this warning for now.
Can you please help, where i’m doing mistake???
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC(gamma=’auto’)))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
ValueError Traceback (most recent call last)
in
13 for name, model in models:
14 kfold = model_selection.KFold(n_splits=10, random_state=seed)
—> 15 cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_val_score(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, error_score)
400 fit_params=fit_params,
401 pre_dispatch=pre_dispatch,
–> 402 error_score=error_score)
403 return cv_results[‘test_score’]
404
~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in cross_validate(estimator, X, y, groups, scoring, cv, n_jobs, verbose, fit_params, pre_dispatch, return_train_score, return_estimator, error_score)
238 return_times=True, return_estimator=return_estimator,
239 error_score=error_score)
–> 240 for train, test in cv.split(X, y, groups))
241
242 zipped_scores = list(zip(*scores))
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self, iterable)
915 # remaining jobs.
916 self._iterating = False
–> 917 if self.dispatch_one_batch(iterator):
918 self._iterating = self._original_iterator is not None
919
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in dispatch_one_batch(self, iterator)
757 return False
758 else:
–> 759 self._dispatch(tasks)
760 return True
761
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in _dispatch(self, batch)
714 with self._lock:
715 job_idx = len(self._jobs)
–> 716 job = self._backend.apply_async(batch, callback=cb)
717 # A job can complete so quickly than its callback is
718 # called before we get here, causing self._jobs to
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in apply_async(self, func, callback)
180 def apply_async(self, func, callback=None):
181 “””Schedule a func to be run”””
–> 182 result = ImmediateResult(func)
183 if callback:
184 callback(result)
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\_parallel_backends.py in __init__(self, batch)
547 # Don’t delay the application, to avoid keeping the input
548 # arguments in memory
–> 549 self.results = batch()
550
551 def get(self):
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in __call__(self)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
–> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\Anaconda3\lib\site-packages\sklearn\externals\joblib\parallel.py in (.0)
223 with parallel_backend(self._backend, n_jobs=self._n_jobs):
224 return [func(*args, **kwargs)
–> 225 for func, args, kwargs in self.items]
226
227 def __len__(self):
~\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py in _fit_and_score(estimator, X, y, scorer, train, test, verbose, parameters, fit_params, return_train_score, return_parameters, return_n_test_samples, return_times, return_estimator, error_score)
526 estimator.fit(X_train, **fit_params)
527 else:
–> 528 estimator.fit(X_train, y_train, **fit_params)
529
530 except Exception as e:
~\Anaconda3\lib\site-packages\sklearn\linear_model\logistic.py in fit(self, X, y, sample_weight)
1284 X, y = check_X_y(X, y, accept_sparse=’csr’, dtype=_dtype, order=”C”,
1285 accept_large_sparse=solver != ‘liblinear’)
-> 1286 check_classification_targets(y)
1287 self.classes_ = np.unique(y)
1288 n_samples, n_features = X.shape
~\Anaconda3\lib\site-packages\sklearn\utils\multiclass.py in check_classification_targets(y)
169 if y_type not in [‘binary’, ‘multiclass’, ‘multiclass-multioutput’,
170 ‘multilabel-indicator’, ‘multilabel-sequences’]:
–> 171 raise ValueError(“Unknown label type: %r” % y_type)
172
173
ValueError: Unknown label type: ‘continuous’
I have some suggestions here:
https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
Thanks jason ur teachings r really helpful more power to u thanks a ton…learning lots of predictive modelling from ur pages!!!
Thank you for your kind words and feedback, Vaisakh!
RandomForestClassifier : 1.0
I got quite different results though i used same seed and splits
Svm : 0.991667 (0.025) with highest accuracy
KNN : 0.9833
CART : 0.9833
Why ?
Im getting error saying
Cannot perform reduce with flexible type
While comparing algos using boxplots
Sorry, I have not seen this error before. Are you able to confirm that your environment is up to date?
I followed your steps and I got the similar result as Aishwarya
SVM: 0.991667 (0.025000)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
The API may have changed since I wrote this post. This in turn may have resulted in small changes in predictions that are perhaps not statistically significant.
Ive done this on kaggle.
Under ML kernal
http://Www.kaggle.com/aishuvenkat09
Sorry
http://Www.kaggle.com/aishwarya09
Well done!
Hi ,
I have same issues with above our friends discussed
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.983333 (0.033333)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)
In that svm has more accuracy when comapre to rest
so i go ahead svm
Yes.
Yes. I got the same. Dr. Jason had mentioned that results might vary.
I also have the same result.
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.983333 (0.033333)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)
Nice.
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
sir i am getting error in this in of code.What should i do?
What error?
File “”, line 1, in
NameError: name ‘model’ is not defined
Looks like you may have missed a few lines of code.
Perhaps try copy-pasting the complete example at the end of each section?
I think cv may be equal to the number of times you want to perform k-fold cross validation for e.g. 10,20etc. and in scoring parameter, you need to mention which type of scoring parameter you want to use for example ‘accuracy’.
Hope this might help….
Correct.
More on how cross validation works here:
https://machinelearningmastery.com/k-fold-cross-validation/
Bro kindly use train_test_split() in the place of model_selection
Try this
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=None)
It worked for me!
put the kfold = , and cv_results = , part inside the for loop it will work fine.
thank you so much really its very useful
in the last step you are used KNN to make predictions why you are used KNN can we use SVM
and can we make compare with all the models in predictions ?
It is just an example, you can make predictions with any model you wish.
Often we prefer simpler models (like knn) over more complex models (like svm).
Hi Jason
I followed your steps but I’m getting error. What should I do? Best regards
>>> # Spot Check Algorithms
… models = []
>>> models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
>>> models.append((‘LDA’, LinearDiscriminantAnalysis()))
>>> models.append((‘KNN’, KNeighborsClassifier()))
>>> models.append((‘CART’, DecisionTreeClassifier()))
>>> models.append((‘NB’, GaussianNB()))
>>> models.append((‘SVM’, SVC(gamma=’auto’)))
>>> # evaluate each model in turn
… results = []
>>> names = []
>>> for name, model in models:
… kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
File “”, line 2
kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
^
IndentationError: expected an indented block
>>> cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘model’ is not defined
>>> results.append(cv_results)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘cv_results’ is not defined
Sorry to hear that.
Try to copy the complete example at the end of the section into a text file and preserve white space:
https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
It’s Ok now. I didnt know python is sensible to TAB. It’s wonderfull. Thanks Thanks
You’re welcome.
Could you elaborate a bit more about the difference between prediction and projection?
For example I got a data set that I collected throughout a year, and I would like to predict/project what will happen next year.
Good question, you find a model that performs well on your available data, fit a final model and use it to predict on new data.
It sounds like perhaps your data is a time series, if so perhaps this would be a good place to start:
https://machinelearningmastery.com/start-here/#timeseries
sir i want to work on crop prices data for crop price pridiction project for my minor project but the crop price data does not find plese help me sir and send me crop price csv file link
Perhaps this will help:
https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
Hello Jason,
Thank you for this amazing tutorial, it helped me to gain confidence:
Please see my results:
LR: 0.941667 (0.065085)
LDA: 0.975000 (0.038188)
KNN: 0.958333 (0.041667)
NB: 0.950000 (0.055277)
SVM: 0.983333 (0.033333)
predictions: [‘Iris-setosa’ ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’
‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-virginica’ ‘Iris-setosa’
‘Iris-setosa’ ‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-setosa’
‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’
‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-setosa’ ‘Iris-setosa’
‘Iris-versicolor’ ‘Iris-versicolor’ ‘Iris-virginica’ ‘Iris-setosa’
‘Iris-virginica’ ‘Iris-versicolor’ ‘Iris-setosa’ ‘Iris-setosa’
‘Iris-versicolor’ ‘Iris-virginica’]
0.9666666666666667
[[11 0 0]
[ 0 12 1]
[ 0 0 6]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 11
Iris-versicolor 1.00 0.92 0.96 13
Iris-virginica 0.86 1.00 0.92 6
accuracy 0.97 30
macro avg 0.95 0.97 0.96 30
weighted avg 0.97 0.97 0.97 30
You’re welcome!
Well done!
The program runs through, but the calculated result is that CART and SVM have the highest accuracy
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.053359)
KNN: 0.983333 (0.050000)
CART: 0.991667 (0.025000)
NB: 0.975000 (0.038188)
SVM: 0.991667 (0.025000)
Nice work. Thanks.
I have installed all libraries that were in your How to Setup Python environment… blog. All went fine but when i run the starting imports code I get error at first line “ModuleNotFoundError: No module named ‘pandas'”. But I did installl it using “pip install pandas” command. I am working on a windows machine.
Sorry to hear that. Consider rebooting your machine?
I had the same problem initially, because I made 2 python files.. one for loading the libraries, and another for loading the iris dataset.
Then I decided to put the two commands in one python file, it solved problem. 🙂
Yes, all commands go in the one file. Sorry for the confusion.
Hasnain, try setting the environment variable PYTHON_PATH and PATH to include the path to the site packages of the version of python you have permission to alter
export PYTHONPATH=”$PYTHONPATH:/path/to/Python/2.7/site-packages/”
export PATH=”$PATH:/path/to/Python/2.7/site-packages/”
obviously replacing “/path/to” with the actual path. My system Python is in my /Users//Library folder but I’m on a Mac.
You can add the export lines to a script that runs when you open a terminal (“~/.bash_profile” if you use BASH).
That might not be 100% right, but it should help you on your way.
Thanks for posting the tip Dan, I hope it helps.
got it to work have no idea how but it worked! I am like the kid at t-ball that closes his eyes and takes a swing!
I’m glad to hear that!
I am starting at square 0, and after clearing a first few hurdles, I was not even able to install the libraries at all… (as a newb), I didn’t see where I even GO to import this:
# Load libraries
import pandas
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
Perhaps this step-by-step tutorial will help you set up your environment:
https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
if u r using python 3
save all the commands as a py file
then in a pythin shell enter
exec(open(“[path to file with name]”).read())
if u open shell in the same path as the saved thing
then u only need to enter the filename alone
ex:
lets say i saved it as load.py
then
exec(open(“load.py”).read())
this will execute all commands in the current shell
Hi Tanya,
This tutorial is so intuitive that I went through this tutorial with a breeze.
Install PyCharm from JetBrains available here https://www.jetbrains.com/pycharm/download/download-thanks.html?platform=windows&code=PCC
Install PIP (The de-facto python package manager) and then click “Terminal” in PyCharm to bring up the interactive DOS like terminal. Once you have installed PIP then there you can issue the following commands:
pip install numpy
pip install scipy
pip install matplotlib
pip install pandas
pip install sklearn
All other steps in the tutorial are valid and do not need a single line of change apart from where its mentioned
from pandas.tools.plotting import scatter_matrix , change it to
from pandas.plotting import scatter_matrix
Thanks for the tips Rahul.
For a beginner i believe Anacondas Jupyter notebooks would be the best option. As they can include markdown for future reference which is essential as beginner (backpropogation :p). But again varies person to person
I find notebooks confuse beginners more than help.
Running a Python script on the command line is so much simpler.
Except for me, on Debian Stretch with pandas 0.19.2, I had to use
from pandas.tools.plotting import scatter_matrix
You must update your version of Pandas.
use jupyter notebook …there all the essential libraries are preinstalled
I also did a similar mistake, I am also a newbie to python, and wrote those import statements in the separate file, and imported the created file, without knowing how imports work…after your reply realized my mistake and now back on track thanks!
I also had problems installing modules on windows. Although, there was no error of any kind if installed from PyCharm IDE.
Also, use 32-bit python interpreter if you wanna use NLTK. It can be done even on 64-bit version, but was not worth the time it would it need.
If you are working on virtual environment then you have to make script first and run it by activating the virtual environment,
If you are not working on virtual environment then run your scripts on time
Could you please go into the mathematical concept behind KNN and why the accuracy resulted in the highest score? Thank you
I like your tutorial for the machine learning in python but at this moment I am stuck. Here is where I am
# Compare Algorithms
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison’)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
This is the answer I am getting from it
TypeError Traceback (most recent call last)
in ()
3 fig.suptitle(‘Algorithm Comparison’)
4 ax = fig.add_subplot(111)
—-> 5 plt.boxplot(results)
6 ax.set_xticklabels(names)
7 plt.show()
~\Anaconda3\lib\site-packages\matplotlib\pyplot.py in boxplot(x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder, hold, data)
2846 whiskerprops=whiskerprops,
2847 manage_xticks=manage_xticks, autorange=autorange,
-> 2848 zorder=zorder, data=data)
2849 finally:
2850 ax._hold = washold
~\Anaconda3\lib\site-packages\matplotlib\__init__.py in inner(ax, *args, **kwargs)
1853 “the Matplotlib list!)” % (label_namer, func.__name__),
1854 RuntimeWarning, stacklevel=2)
-> 1855 return func(ax, *args, **kwargs)
1856
1857 inner.__doc__ = _add_data_doc(inner.__doc__,
~\Anaconda3\lib\site-packages\matplotlib\axes\_axes.py in boxplot(self, x, notch, sym, vert, whis, positions, widths, patch_artist, bootstrap, usermedians, conf_intervals, meanline, showmeans, showcaps, showbox, showfliers, boxprops, labels, flierprops, medianprops, meanprops, capprops, whiskerprops, manage_xticks, autorange, zorder)
3555
3556 bxpstats = cbook.boxplot_stats(x, whis=whis, bootstrap=bootstrap,
-> 3557 labels=labels, autorange=autorange)
3558 if notch is None:
3559 notch = rcParams[‘boxplot.notch’]
~\Anaconda3\lib\site-packages\matplotlib\cbook\__init__.py in boxplot_stats(X, whis, bootstrap, labels, autorange)
1839
1840 # arithmetic mean
-> 1841 stats[‘mean’] = np.mean(x)
1842
1843 # medians and quartiles
~\Anaconda3\lib\site-packages\numpy\core\fromnumeric.py in mean(a, axis, dtype, out, keepdims)
2955
2956 return _methods._mean(a, axis=axis, dtype=dtype,
-> 2957 out=out, **kwargs)
2958
2959
~\Anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
68 is_float16_result = True
69
—> 70 ret = umr_sum(arr, axis, dtype, out, keepdims)
71 if isinstance(ret, mu.ndarray):
72 ret = um.true_divide(
TypeError: cannot perform reduce with flexible type
HOW CAN I FIX THIS?
Perhaps post your code and error to stackoverflow.com?
Jason nice work.but I had some doubt about that Species column, in that we should predict t test for continuous and catagorical variable only 2 group..in this column there having 3 groups so how we predict t test.please give me answer
The Student’s t-test is for numerical data only, you can learn more here:
https://machinelearningmastery.com/parametric-statistical-significance-tests-in-python/
I also got a traceback on this section:
TypeError: cannot perform reduce with flexible type
Quick check on stackoverflow show’s that plt.boxplot() cannot accept strings. Personally, I had an error in section 5.4 line 15.
Wrong code: results.append(results)
Coorect: resilts.append(cv_results)
woohoo for tracebacks and wrong data-types. Hope someone finds this helpful.
Are you able to confirm that your python libraries are up to date?
Well done
Thank you sir!
Nice work Jason. Of course there is a lot more to tell about the code and the Models applied if this is intended for people starting out with ML (like me). Rather than telling which “button to press” to make work, it would be nice to know why also. I looked at a sample of you book (advanced) if you are covering the why also, but it looks like it’s limited?
On this particular example, in my case SVM reached 99.2% and was thus the best Model. I gather this is because the test and training sets are drawn randomly from the data.
This tutorial and the book are laser focused on how to use Python to complete machine learning projects.
They already assume you know how the algorithms work.
If you are looking for background on machine learning algorithms, take a look at this book:
https://machinelearningmastery.com/master-machine-learning-algorithms/
Jan de Lange and Jason,
Before anything else, I truly like to thank Jason for this wonderful, concise and practical guideline on using ML for solving a predictive problem.
In terms of the example you have provided, I can confirm ‘Jan de Lange’ ‘s outcome. I’ve got the same accuracy result for SVM (0.991667 to be precise). I’ve just upgraded the Canopy version I had installed on my machine to version 2.1.3.3542 (64 bit) and your reasoning makes sense that this discrepancy could be because of its random selection of data. But this procedure could open up a new ‘can of warm’ as some say. since the selection of best model is on the line.
Thank you again Jason for this practical article on ML.
Thanks Alan.
Absolutely. Machine learning algorithms are stochastic. This is a feature, not a bug. It helps us move through the landscape of possible models efficiently.
See this post:
https://machinelearningmastery.com/randomness-in-machine-learning/
And this post on finalizing a model:
https://machinelearningmastery.com/train-final-machine-learning-model/
Does that help?
Got it working too, changing the scatter_matrix import like Rahul did.
But I also had to install tkinter first (yum install tkinter).
Very nice tutorial, Jason!
Glad to hear it!
Awesome, I have tested the code it is impressive. But how could I use the model to predict if it is Iris-setosa or Iris-versicolor or Iris-virginica when I am given some values representing sepal-length, sepal-width, petal-length and petal-width attributes?
Great question. You can call model.predict() with some new data.
For an example, see Part 6 in the above post.
Dear Jason Brownlee, I was thinking about the same question of Nil. To be precise I was wondering how can I know, after having seen that my model has a good fit, which values of sepal-length, sepal-width, petal-length and petal-width corresponds to Iris-setosa eccc..
For instance, if I have p predictors and two classes, how can I know which values of the predictors blend to one class or the other. Knowing the value of predictors allows me to use the model in the daily operativity. Thx
Not knowing the statistical relationship between inputs and outputs is one of the down sides of using neural networks.
Hi Mr Jason Brownlee, thks for your answer. So all algorithms, such as SVM, LDA, random forest.. have this drawbacks? Can you suggest me something else?
Because logistic regression is not like this, or am I wrong?
All algorithms have limitations and assumptions. For example, Logistic Regression makes assumptions about the distribution of variates (Gaussian) and more:
https://en.wikipedia.org/wiki/Logistic_regression
Nevertheless, we can make useful models (skillful) even when breaking assumptions or pushing past limitations.
Dear Sir,
It seems I’m in the right place in right time! I’m doing my master thesis in machine learning from Stockholm University. Could you give me some references for laughter audio conversation to CSV file? You can send me anything on [email protected]. Thanks a lot and wish your very best and will keep in touch.
Sorry I mean laughter audio to CSV conversion.
Sorry, I have not seen any laughter audio to CSV conversion tools/techniques.
Hi again, do you have any publication of this article “Your First Machine Learning Project in Python Step-By-Step”? Or any citation if you know? Thanks.
No, you can reference the blog post directly.
Sweet way of condensing monstrous amount of information in a one-way street. Thanks!
Just a small thing, you are creating the Kfold inside the loop in the cross validation. Then, you use the same seed to keep the comparison across predictors constant.
That works, but I think it would be better to take it out of the loop. Not only is more efficient, but it is also much immediately clearer that all predictors are using the same Kfold.
You can still justify the use of the seeds in terms of replicability; readers getting the same results on their machines.
Thanks again!
Great suggestion, thanks Roberto.
Hello Jaso.
Thank you so much for your help with Machine Learning and congratulations for your excellent website.
I am a beginner in ML and DeepLearning. Should I download Python 2 or Python 3?
Thank you very much.
Francisco
I use Python 2 for all my work, but my students report that most of my examples work in Python 3 with little change.
Jason,
Thank you so much for putting this together. I am been a software developer for almost two decades and am getting interested in machine learning. Found this tutorial accurate, easy to follow and very informative.
Thanks ShawnJ, I’m glad you found it useful.
Jason,
Thanks for the great post! I am trying to follow this post by using my own dataset, but I keep getting this error “Unknown label type: array ([some numbers from my dataset])”. So what’s the problem on earth, any possible solutions?
Thanks,
Hi Wendy,
Carefully check your data. Maybe print it on the screen and inspect it. You may have some string values that you may need to convert to numbers using data preparation.
hi thanks for great tutorial, i’m also new to ML…this really helps but i was wondering what if we have non-numeric values? i have mixture of numeric and non-numeric data and obviously this only works for numeric. do you also have a tutorial for that or would you please send me a source for it? thank you
Great question fara.
We need to convert everything to numeric. For categorical values, you can convert them to integers (label encoding) and then to new binary features (one hot encoding).
after I post my comment here i saw this: “DictVectorizer ” i think i can use it for converting non-numeric to numeric, right?
I would recommend the LabelEncoder class followed by the OneHotEncoder class in scikit-learn.
I believe I have tutorials on these here:
https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/
thank you it’s great
Hello Jason
Thank you for publishing this great machine learning tutorial.
It is really awesome awesome awesome………..!
I test your tutorial on python-3 and it works well but what I face here is to load my data set from my local drive. I followed your give instructions but couldn’t be successful.
My syntax is as under:
import unicodedata
url = open(r’C:\Users\mazhar\Anaconda3\Lib\site-packages\sindhi2.csv’, encoding=’utf-8′).readlines()
names = [‘class’, ‘sno’, ‘gender’, ‘morphology’, ‘stem’,’fword’]
dataset = pandas.read_csv(url, names=names)
python-3 jupyter notebook does not loads this. Kindly help me in regard.
Hi Mazhar, thanks.
Are you able to load the file on the command line away from the notebook?
Perhaps the notebook environment is causing trouble?
Mazhar try this:
import pandas as pd
.
.
.
file= \”namefile.csv\” #or c:/____/___/
df = pd.read_csv(file)
in Jupyter
https://www.anaconda.com/download/
https://anaconda.org/anaconda/python
Dear Jason
Thank you for response
I am using Python 3 with anaconda jupyter notebook
so which python version you would like to suggest me and kindly write here syntax of opening local dataset file from local drive that how can I load utf-8 dataset file from my local drive.
Hi Mazhar, I teach using Python 2.7 with examples from the command line.
Many of my students report that the code works in Python 3 and in notebooks with little or no changes.
try with this command:
df = pd.read_csv(file, encoding=’latin-1′) #if you are working with csv “,” or “;” put sep=’|’,
Great tutorial but perhaps I’m missing something here. Let’s assume I already know what model to use (perhaps because I know the data well… for example).
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
I then use the models to predict:
print(knn.predict(an array of variables of a record I want to classify))
Is this where the whole ML happens?
knn.fit(X_train, Y_train)
What’s the difference between this and say a non ML model/algorithm? Is it that in a non ML model I have to find the coefficients/parameters myself by statistical methods?; and in the ML model the machine does that itself?
If this is the case then to me it seems that a researcher/coder did most of the work for me and wrap it in a nice function. Am I missing something? What is special here?
Hi Andy,
Yes, your comment is generally true.
The work is in the library and choice of good libraries and training on how to use them well on your project can take you a very long way very quickly.
Stats is really about small data and understanding the domain (descriptive models). Machine learning, at least in common practice, is leaning towards automation with larger datasets and making predictions (predictive modeling) at the expense of model interpretation/understandability. Prediction performance trumps traditional goals of stats.
Because of the automation, the focus shifts more toward data quality, problem framing, feature engineering, automatic algorithm tuning and ensemble methods (combining predictive models), with the algorithms themselves taking more of a backseat role.
Does that make sense?
It does make sense.
You mentioned ‘data quality’. That’s currently my field of work. I’ve been doing this statistically until now, and very keen to try a different approach. As a practical example how would you use ML to spot an error/outlier using ML instead of stats?
Let’s say I have a large dataset containing trees: each tree record contains a specie, height, location, crown size, age, etc… (ah! suspiciously similar to the iris flowers dataset 🙂 Is ML a viable method for finding incorrect data and replace with an “estimated” value? The answer I guess is yes. For species I could use almost an identical method to what you presented here; BUT what about continuous values such as tree height?
Hi Andy,
Maybe “outliers” are instances that cannot be easily predicted or assigned ambiguous predicted probabilities.
Instance values can be “fixed” by estimating new values, but whole instance can also be pulled out if data is cheap.
Awesome work Jason. This was very helpful and expect more tutorials in the future.
Thanks.
I’m glad you found it useful Shailendra.
Thank you for the good work you doing over here.
i want to know how electricity appliance consumption dataset is captured
Thanks, I’m glad it helped.
If you are referring to the time series examples, you can learn more about the dataset here:
https://archive.ics.uci.edu/ml/datasets/individual+household+electric+power+consumption
Awesome work. Students need to know how the end results will look like. They need to get motivated to learn and one of the effective means of getting motivated is to be able to see and experience the wonderful end results. Honestly, if i were made to study algorithms and understand them i would get bored. But now since i know what amazing results they give, they will serve as driving forces in me to get into details of it and do more research on it. This is where i hate the orthodox college ways of teaching. First get the theory right then apply. No way. I need to see things first to get motivated.
Thanks Shuvam,
I’m glad my results-first approach gels with you. It’s great to have you here.
Thanks Jason,
while i am trying to complete this.
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
showing below error.-
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
^
IndentationError: expected an indented block-
Hi Puneet, looks like a copy-paste error.
Check for any extra new lines or white space around that line that is reporting the error.
https://stackoverflow.com/questions/4446366/why-am-i-getting-indentationerror-expected-an-indented-block
This solved it for me. Copy code to notepad, replace all tabs with 4 spaces.
Nice work.
Putting in an extra space or leaving one out where it is needed will surely generate an error message . Some common causes of this error include:
Forgetting to indent the statements within a compound statement
Forgetting to indent the statements of a user-defined function.
The error message IndentationError: expected an indented block would seem to indicate that you have an indentation error. It is probably caused by a mix of tabs and spaces. The indentation can be any consistent white space . It is recommended to use 4 spaces for indentation in Python, tabulation or a different number of spaces may work, but it is also known to cause trouble at times. Tabs are a bad idea because they may create different amount if spacing in different editors .
http://net-informations.com/python/err/indentation.htm
Great advice
Here’s help for copy-pasting code:
https://machinelearningmastery.com/faq/single-faq/how-do-i-copy-code-from-a-tutorial
Thanks Json,
I am new to ML. need your help so i can run this.
as i have followed the steps but when trying to build and evalute 5 model using this.
—————————————-
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
————————————————————————————————
facing below mentioned issue.
File “”, line 13
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
^
IndentationError: expected an indented block
—————————————
Kindly help.
Puneet, you need to indent the block (tab or four spaces to the right). That is the way of building a block in Python
I am also having this problem, I have indented the code as instructed but nothing executes. It seems to be waiting for more input. I have googled different script endings but nothing happens. Is there something I am missing to execute this script?
>>> for name, model in models:
… kfold = model_selection.KFold(n_splits=10, random_state=seed)
… cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
… results.append(cv_results)
… names.append(name)
… msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
… print(msg)
…
Save the code to a file and run it from the command line. I show how here:
https://machinelearningmastery.com/faq/single-faq/how-do-i-run-a-script-from-the-command-line
just another Python noob here,sending many regards and thanks to Jason :):)
Thanks george, stick with it!
Does this tutorial work with other data sets? I’m trying to work on a small assignment and I want to use python
It should provide a great template for new projects sergio.
I tried to use another dataset. I am not sure what I imported, but even after changing the names, I still get the petal stuff as output. All of it. I commented out that part of the code and even then it gives me those old outputs.
Very Awesome step by step for me ! Even I am beginner of python , this gave me many things about Machine learning ~ supervised ML. Appreciate of your sharing !!
I’m glad to hear that Albert.
Thank you for the step by step instructions. This will go along way for newbies like me getting started with machine learning.
You’re welcome, I’m glad you found the post useful Umar.
Hello Jason,
from __future__ import division
models = []
models.append((‘LR’, LogisticRegression(solver=’liblinear’, multi_class=’ovr’)))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC(gamma=’auto’)))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
I am getting erroe of ” ZeroDivisionError: float division by zero”
Sorry to hear that, I have some suggestions here:
https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
Hi Jason,
Really nice tutorial. I had one question which has had me confused. Once you chose your best model, (in this instance KNN) you then train a new model to be used to make predictions against the validation set. should one not perform K-fold cross-validation on this model to ensure we don’t overfit?
if this is correct how would you implement this, from my understanding cross_val_score will not allow one to generate a confusion matrix.
I think this is the only thing that I have struggled with in using scikit learn if you could help me it would be much appreciated?
Hi Mike. No.
Cross-validation is just a method to estimate the skill of a model on new data. Once you have the estimate you can get on with things, like confirming you have not fooled yourself (hold out validation dataset) or make predictions on new data.
The skill you report is the cross val skill with the mean and stdev to give some idea of confidence or spread.
Does that make sense?
Hi Jason,
Thanks for the quick response. So to make sure I understand, one would use cross validation to get a estimate of the skill of a model (mean of cross val scores) or chose the correct hyper parameters for a particular model.
Once you have this information you can just go ahead and train the chosen model with the full training set and test it against the validation set or new data?
Hi Mike. Correct.
Additionally, if the validation result confirms your expectations, you can go ahead and train the model on all data you have including the validation dataset and then start using it in production.
This is a very important topic. I think I’ll write a post about it.
This is amazing 🙂 You boosted my morale
I’m so glad to hear that Sahana.
Hi
while doing data visualization and running commands dataset.plot(……..) i am having the following error.kindly tell me how to fix it
array([[,
],
[,
]], dtype=object)
Looks like no data Jhon. It also looks like it’s printing out an object.
Are you running in a notebook or on the command line? The code was intended to be run directly (e.g. command line).
Hi Jason,
Great tutorial. I am a developer with a computer science degree and a heavy interest in machine learning and mathematics, although I don’t quite have the academic background for the latter except for what was required in college. So, this website has really sparked my interest as it has allowed me to learn the field in sort of the “opposite direction”.
I did notice when executing your code that there was a deprecation warning for the sklearn.cross_validation module. They recommend switching to sklearn.model_selection.
When switching the modules I adjusted the following line…
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
to…
kfold = model_selection.KFold(n_folds=num_folds, random_state=seed)
… and it appears to be working okay. Of course, I had switched all other instances of cross_validation as well, but it seemed to be that the KFold() method dropped the n (number of instances) parameter, which caused a runtime error. Also, I dropped the num_instances variable.
I could have missed something here, so please let me know if this is not a valid replacement, but thought I’d share!
Once again, great website!
Thanks for the support and the kind words Brendon. I really appreciate it (you made my day!)
Yes, the API has changed/is changing and your updates to the tutorial look good to me, except I think n_folds has become n_splits.
I will update this example for the new API very soon.
🙂 Now on to more tutorials for me!
You can access more here Brendon:
https://machinelearningmastery.com/start-here/
Jason, is everything on your website on that page? or is there another site map?
thanks!
P.S. your code ran flawlessly on my Jupyter Notebook fwiw. Although I did get a different result with SVM coming out on top with 99.1667. So I ran the validation set with SVM and came out with 94 93 93 30 fwiw.
No, not everything, just a small and useful sample.
Yes, machine learning algorithms are stochastic, learn more here:
https://machinelearningmastery.com/randomness-in-machine-learning/
Thanks. I actually just read that article. Very helpful.
I’m still having a little trouble understanding step 5.1. I’m trying to apply this tutorial to a new data set but, when I try to evaluate the models from 5.3 I don’t get a result.
What is the problem exactly Sergio?
Step 5.1 should create a validation dataset. You can confirm the dataset by printing it out.
Step 5.3 should print the result of each algorithm as it is trained and evaluated.
Perhaps check for a copy-paste error or something?
Does this tutorial work the exact same way for other data sets? because I’m not using the Hello World dataset
The project template is quite transferable.
You will need to adapt it for your data and for the types of algorithms you want to test.
Hi Sir,
Thank you for the information.
I am currently a student, in Engineering school in France.
I am working on date mining project, indeed, I have a many date ( 40Go ) about the price of the stocks of many companies in the CAC40.
My goal is to predict the evolution of the yields and I think that Neural Network could be useful.
My idea is : I take for X the yields from “t=0” to “t=n” and for Y the yields from “t=1 to t=n” and the program should find a relation between the data.
Is that possible ? Is it a good way in order to predict the evolution of the yield ?
Thank you for your time
Hubert
Jean-Baptiste
Hi Jean-Baptiste, I’m not an expert in finance. I don’t know if this is reasonable, sorry.
This post might help with phrasing your time series problem for supervised learning:
https://machinelearningmastery.com/time-series-forecasting-supervised-learning/
Hi Jason,
If I include an new item in the models array as:
models.append((‘LNR – Linear Regression’, LinearRegression()))
with the library:
from sklearn.linear_model import LinearRegression
I got an error in the \sklearn\utils\validation.py”, line 529, in check_X_y
y = y.astype(np.float64)
as:
ValueError: could not convert string to float: ‘Iris-setosa’
Let me know best to fix that! As you can see from my code, I would like to include the Linear Regression algorithms in my array model too!
Thank you for your help,
Ernest
Hi Ernest, it is a classification problem. We cannot use LinearRegression.
Try adding another classification algorithm to the list.
Hi Jason,
I am new to ML. need your help so i can run this.
>>> from matplotlib import pyplot
Traceback (most recent call last):
File “”, line 1, in
File “c:\python27\lib\site-packages\matplotlib\pyplot.py”, line 29, in
import matplotlib.colorbar
File “c:\python27\lib\site-packages\matplotlib\colorbar.py”, line 32, in
import matplotlib.artist as martist
File “c:\python27\lib\site-packages\matplotlib\artist.py”, line 16, in
from .path import Path
File “c:\python27\lib\site-packages\matplotlib\path.py”, line 25, in
from . import _path, rcParams
‘ImportError: DLL load failed: %1 n\x92est pas une application Win32 valide.\n’
Sorry, I have not seen that error before. Perhaps this post will help you setup your environment:
https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
hello oumaima,
i am also facing the same error? were you able to solve your error? how? please help!
Great tutorial! Quick question, for the when we create the models, we do models.append(name of algorithm, alogrithm function), is models an array? Because it seems like a dictionary since we have a key-value mapping (algorithm name, and algorithm function). Thank you!
It is a list of tuples where each tuple contains a string name and a model object.
Hi Jason /any Gurus ,
Good post and will follow it but my question may be little off track.
Asking this question as i am a data modeller /aspiring data architect.
I i feel as Guru/Gurus you can clarify my doubt. The question is at the end .
In current Data management environment
1. Data architecture /Physical implementation and choosing appropriate tools,back end,storage,no sql, SQL, MPP, sharding, columnar ,,scale up/out ,distributed processing etc .
2. In addition to DB based procedural languages proficiency at at least one of the following i.e. Java/Python/Scala etc.
3. Then comes this AI,Machine learning ,neural Networks etc .
My question is regarding point 3 .
I believe those are algorithms which needs deep functional knowledge and years of experience to add any value to business .
Those are independent of data models and it’s ,physical implementation and part of Business user domain not data architecture domain .
If i take your above example say now 10k users trying to do the similar kind of thing then points 1 and 2 will be Data architects a domain and point 3 will be business analyst domain . may be point 2 can overlap between them to some extent .
Data Architect need not to be hands on/proficient in algorithms i.e. should have just some basic idea as Data architects job is not to invent business logic but implement the business logic physically to satisfy Business users/Analysts .
Am i correct in my assumption as i find the certain things are nearly mutually exclusive and expectations/benchmarks should be set right?
Regards
sasanka ghosh
Hi Sasanka, sorry, I don’t really follow.
Are you able to simplify your question?
Hi Jason ,
Many thanks that u bothered to reply .
Tried to rephrase and concise but still it is verbose . apologies for that.
Is it expected from a data architect to be algorithm expert as well as data model/database expert?
Algorithms are business centric as well as specific to particular domain of business most of the times.
Giving u an example i.e. SHORTEST PATH ( take it as just an example in making my point)
An organization is providing an app to provide that service .
CAVEAT:Someone may say from comp science dept that it is the basic thing u learn but i feel it is still an algorithm not a data structure .
if we take the above scenario in simplistic term the requirement is as follows
1.there will be say million registered users
2. one can say at least 10 % are using the app same time
3. any time they can change their direction as per contingency like a military op so dumping the partial weighted graph to their device is not an option i.e. users will be connected to main server/server cluster.
4. the challenge is storing the spatial data in DB in correct data model .
scale out ,fault tolerance .
5.implement the shortest path algo and display it using Python/java/Cipher/Oracle spatial/titan etc dynamically.
My question is can a data architect work on this project who does not know the shortest path algorithm but have sufficient knowledge in other areas but the algo with verbose term provided to him/her to implement ?
I m asking this question as now a days people are offering ready made courses etc i.e. machine learning ,NLP,Data scientist etc. and the scenario is confusing
i feel it is misleading as no one can get expert in science overnight and vice versa.
I feel Algorithms are pure science that is a separate discipline .
But to implement it in large scale Scientists/programmers/architects needs to work in tandem with minimal overlapping but continuous discussion.
Last but not the least if i make some sense what is the learning curve should i follow to try to be a data architect in unstructured data in general
regards
sasanka ghosh
Really this depends on the industry and the job. I cannot give you good advice for the general case.
You can get valuable results without being an expert, this applies to most fields.
Algorithms are a tool, use them as such. They can also be a science, but we practitioners don’t have the time.
I hope that helps.
Thanks Jsaon.
I appreciate your time and response .
I just wanted to validate from a real techie/guru like u as the confusion or no perfect answer are being exploited by management/HR to their own advantage and practice use and throw policy or make people sycophants/redundant without following the basic management principle,
The tech guys except ” few geniuses” are always toiling and management is opening the cork, enjoying at the same time .
Regards
sasanka ghosh
Hello Jason,
Thank you very much for these tutorials, i am new to ML and i find it very encouraging to do and end to end project to get started with rather than reading and reading without seeing and end, This really helped me..
One question, when i tried this i got the highest accuracy for SVM.
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.983333 (0.033333)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)
so i decided to try that out too,,
svm = SVC()
svm.fit(X_train, Y_train)
prediction = svm.predict(X_validation)
these were my results using SVM,
0.933333333333
[[ 7 0 0]
[ 0 10 2]
[ 0 0 11]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 1.00 0.83 0.91 12
Iris-virginica 0.85 1.00 0.92 11
avg / total 0.94 0.93 0.93 30
I am still learning to read these results, but can you tell me why this happened? why did i get high accuracy for SVM instead of KNN?? have i done anything wrong? or is it possible?
The results reported are a mean estimated score with some variance (spread).
It is an estimate on the performance on new data.
When you apply the method on new data, the performance may be in that range. It may be lower if the method has overfit the training data.
Overfitting is a challenge and developing a robust test harness to ensure we don’t fool/mislead ourselves during model development is important work.
I hope that helps as a start.
i want to buy your book.
i try this tutorial and the result is very awesome
i want to learn from you
thanks….
Thanks inzar.
You can see all of my books and bundles here:
https://machinelearningmastery.com/products
Why the leading comma in X = array[:,0:4]?
This is Python array notation for [rows,columns]
Learn more about slicing arrays in Python here:
http://structure.usc.edu/numarray/node26.html
In 1.2 , should warn to install scikit-learn
Thanks for the note.
Please see section 1.1 Install SciPy Libraries where it says:
Best ML tutorial for Python. Thank you, Jason.
Thanks!
when i tried run, i have error message” TypeError: Empty ‘DataFrame’: no numeric data to plot” help me
Sorry to hear that.
Perhaps check that you have loaded the data as you expect and that the loaded values are numeric and not strings. Perhaps print the first few rows: print(df.head(5))
thanks very much Jason for your time
it worked. these tutorial very help for me. im new in Machine learning, but may you explain to me about your simple project above? because i did not see X_test and target
regard in advance
Glad to hear it baso!
Thank you for sharing this. I bumped into some installation problems.
Eventually, yo get all dependencies installed on MacOS 10.11.6 I had to run this:
brew install python
pip install –user numpy scipy matplotlib ipython jupyter pandas sympy nose scikit-learn
export PATH=$PATH:~/Library/Python/2.7/bin
Thanks for sharing Andrea.
I’m a macports guy myself, here’s my recipe:
Hi Jason,
I am following this page as a beginner and have installed Anaconda as recommended.
As I am on win 10, I installed Anaconda 4.2.0 For Windows Python 2.7 version (x64) and
I am using Anaconda’s Spyder (python 2.7) IDE.
I checked all the versions of libraries (as shown in 1.2 Start Python and Check Versions) and got results like below:
Python: 2.7.12 |Anaconda 4.2.0 (64-bit)| (default, Jun 29 2016, 11:07:13) [MSC v.1500 64 bit (AMD64)]
scipy: 0.18.1
numpy: 1.11.1
matplotlib: 1.5.3
pandas: 0.18.1
sklearn: 0.17.1
At the 2.1 Import libraries section, I imported all of them and tried to load data as shown in
2.2 Load Dataset. But when I run it, it doesn’t show an output, instead, there is an error:
Traceback (most recent call last):
File “C:\Users\gachon\.spyder\temp.py”, line 4, in
from sklearn import model_selection
ImportError: cannot import name model_selection
Below is my code snippet:
import pandas
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
dataset = pandas.read_csv(url, names=names)
print(dataset.shape)
When I delete “from sklearn import model_selection” line I get expected results (150, 5).
Am I missing something here?
Thank you for your time and endurance!
Hi Sohib,
You must have scikit-learn version 0.18 or higher installed.
Perhaps Anaconda has documentation on how to update sklearn?
Thank you for reply.
I updated scikit-learn version to 0.18.1 and it helped.
The error disappeared, the result is shown, but one statement
‘import sitecustomize’ failed; use -v for traceback
is executed above the result.
I tried to find out why, but apparently I might not find the reason.
Is it going to be a problem in my further steps?
How to solve this?
Thank you in advance!
I’m glad to hear it fixed your problem.
Sorry, I don’t know what “import sitecustomize” is or why you need it.
Can i get the same tutorial with java
Hi Jason,
Nice tutorial.
In univariate plots, you mentioned about gaussian distribution.
According to the univariate plots, sepat-width had gaussian distribution. You said there are 2 variables having gaussain distribution. Please tell the other.
Thanks
The distribution of the others may be multi-modal. Perhaps a double Gaussian.
Hi, Jason. Could you please tell me the reason Why you choose KNN in example above ?
Hi Thinh,
No reason other than it is an easy algorithm to run and understand and good algorithm for a first tutorial.
Hi Jason,
I’m trying to use this code with the KDD Cup ’99 dataset, and I am having trouble with LabelEncoding my dataset in to numerical values.
#Modules
import pandas
import numpy
from pandas.tools.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn import cross_validation
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import LabelEncoder
#new
from collections import defaultdict
#
#Load KDD dataset
data_set = “NSL-KDD/KDDTrain+.txt”
names = [‘duration’,’protocol_type’,’service’,’flag’,’src_bytes’,’dst_bytes’,’land’,’wrong_fragment’,’urgent’,’hot’,’num_failed_logins’,’logged_in’,’num_compromised’,’su_attempted’,’num_root’,’num_file_creations’,
‘num_shells’,’num_access_files’,’num_outbound_cmds’,’is_host_login’,’is_guest_login’,’count’,’srv_count’,’serror_rate’,’srv_serror_rate’,’rerror_rate’,’srv_rerror_rate’,’same_srv_rate’,’diff_srv_rate’,’srv_diff_host_rate’,
‘dst_host_count’,’dst_host_srv_count’,’dst_host_same_srv_rate’,’dst_host_diff_srv_rate’,’dst_host_same_src_port_rate’,’dst_host_srv_diff_host_rate’,’dst_host_serror_rate’,’dst_host_srv_serror_rate’,’dst_host_rerror_rate’,
‘dst_host_srv_rerror_rate’,’class’]
#Diabetes Dataset
#data_set = “Datasets/pima-indians-diabetes.data”
#names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
#data_set = “Datasets/iris.data”
#names = [‘sepal_length’,’sepal_width’,’petal_length’,’petal_width’,’class’]
dataset = pandas.read_csv(data_set, names=names)
array = dataset.values
X = array[:,0:40]
Y = array[:,40]
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Y)
label_encoded_y = label_encoder.transform(Y)
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = cross_validation.train_test_split(X, label_encoded_y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
num_folds = 7
num_instances = len(X_train)
seed = 7
scoring = ‘accuracy’
# Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = cross_validation.KFold(n=num_instances, n_folds=num_folds, random_state=seed)
cv_results = cross_validation.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean()*100, cv_results.std()*100)#multiplying by 100 to show percentage
print(msg)
# Compare Algorithms
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison’)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(Y)
plt.show()
Am I doing something wrong with the LabelEncoding process?
Hi. Change all symbols like “ to ” and ’ to ‘. LabaleEncoder will be work correct but not all network. I try to create a neural network for NSL-KDD too. Have you any good examples?
What is “NSL-KDD”?
Hello Jason,
Please see https://github.com/defcom17/NSL_KDD
I’m not familiar with this, sorry.
How come it is concluded that KNN algorithm is accurate model when mean value for SVM algorithm is closer to 1 in comparison to KNN ?
Either algorithm would be effective on the dataset.
Hi, I’m running a bit of a different setup than yours.
The modules and version of python I’m using are more recent releases:
Python: 3.5.2 |Anaconda 4.2.0 (32-bit)| (default, Jul 5 2016, 11:45:57) [MSC v.1900 32 bit (Intel)]
scipy: 0.18.1
numpy: 1.11.3
matplotlib: 1.5.3
pandas: 0.19.2
sklearn: 0.18.1
And I’ve gotten SVM as the best algorithm in terms of accuracy at 0.991667 (0.025000).
Would you happen to know why this is, considering more recent versions?
I also happened to get a rather different boxplot but I’ll leave it at what I’ve said thus far.
Hi Dan,
You may get differing results for a variety of reasons. Small changes in the code will affect the result. This is why we often report mean and stdev algorithm performance rather than one number, to given a range of expected performance.
This post on randomness in ml algorithms might also help:
https://machinelearningmastery.com/randomness-in-machine-learning/
Hi Jason
I can’t tell you how grateful I am … I have been trawling through lots of ML stuff to try to get started with a “toy” example. Finally I have found the tutorial I was looking for. Anaconda had old sklearn: 0.17.1 for Windows – which caused an error “ImportError: cannot import name ‘model_selection'”. That was fixed by running “pip install -U scikit-learn” from the Anaconda command-line prompt. Now upgraded to 0.18. Now everything in your imports was fine.
All other tutorials were either too simple or too complicated. Usually the latter!
Thank you again 🙂
Glad to hear it Duncan.
Thanks for the tip for Anaconda uses.
I’m here to help if you have questions!
Hi Jason,
Wonderful service. All of your tutorials are very helpful
to me. Easy to understand.
Expecting more tutorials on deep neural networks.
Malathi
You’re very welcome Malathi, glad to hear it.
Hi Jason
I managed to get it all working – I am chuffed to bits.
I get exactly the same numbers in the classification report as you do … however, when I changed both seeds to 8 (from 7), then ALL of the numbers end up being 1. Is this good, or bad? I am a bit confused.
Thanks again.
Well done Duncan!
What do you mean all the numbers end up being one?
Hi Jason
I’ve output the “accuracy_score”, “confusion_matrix” & “classification_report” for seeds 7, 9 & 10. Why am I getting a perfect score with seed=9? Many thanks.
(seed=7)
0.9
[[10 0 0]
[ 0 8 1]
[ 0 2 9]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 10
Iris-versicolor 0.80 0.89 0.84 9
Iris-virginica 0.90 0.82 0.86 11
avg / total 0.90 0.90 0.90 30
(seed=9)
1.0
[[13 0 0]
[ 0 9 0]
[ 0 0 8]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 13
Iris-versicolor 1.00 1.00 1.00 9
Iris-virginica 1.00 1.00 1.00 8
avg / total 1.00 1.00 1.00 30
(seed=10)
0.9666666666666667
[[10 0 0]
[ 0 12 1]
[ 0 0 7]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 10
Iris-versicolor 1.00 0.92 0.96 13
Iris-virginica 0.88 1.00 0.93 7
avg / total 0.97 0.97 0.97 30
Random chance. This is why it is a good idea to use cross-validation with many repeats and report mean and standard deviation scores.
More on randomness in machine learning here:
https://machinelearningmastery.com/randomness-in-machine-learning/
from sklearn import model_selection
showing Import Error: can not import model_selection
You need to update your version of sklearn to 0.18 or higher.
Jason
Excellent Tutorial. New to Python and set a New Years Resolution to try to understand ML. This tutorial was a great start.
I struck the issue of the sklearn version. I am using Ubuntu 16.04LTS which comes with python-sklearn version 0.17. To update to latest I used the site:
http://neuro.debian.net/install_pkg.html?p=python-sklearn
Which gives the commands to add the neuro repository and pull down the 0.18 version.
Also I would like to note there is an error in section 3.1 Dimensions of the Dataset. Your text states 120 Instances when in fact 150 are returned, which you have in the Printout box.
Keep up the good work.
Jim
I’m glad to hear you worked around the version issue Jim, nice work!
Thanks for the note on the typo, fixed!
hi Jason.nice work here. I’m new to your blog. What does the y-axis in the box plots represent?
Hi Raphael,
The y-axis in the box-and-whisker plots are the scale or distribution of each variable.
Thank you for this wonderful tutorial.
You’re welcome Kayode.
hi Jason,
In this line
dataset.groupby(‘class’).size()
what other variable other than size could I use? I changed size with count and got something similar but not quite. I got key errors for the other stuffs I tried. Is size just a standard command?
Great question Raphael.
You can learn more about Pandas groupby() here:
http://pandas.pydata.org/pandas-docs/stable/groupby.html
Jason,
I’m trying to use a different data set (KDD CUP 99′) with the above code, but when I try and run the code after modifying “names” and the array to account for the new features and it will not run as it is giving me an error of: “cannot convert string to a float”.
In my data set, there are 3 columns that are text and the rest are integers and floats, I have tried LabelEncoding but it gives me the same error, do you know how I can resolve this?
Hi Scott,
If the values are indeed strings, perhaps you can use a method that supports strings instead of numbers, perhaps like a decision tree.
If there are only a few string values for the column, a label encoding as integers may be useful.
Alternatively, perhaps you could try removing those string features from the dataset.
I hope that helps, let me know how you go.
I would like a chart to see the grand scope of everything for data science that python can do.
You list 6 basic steps. For example in the visualizing step, I would like to know what all the charts are, what they are used for, and what python library it comes from.
I am extremely new to all this, and understand that some steps have to happen for example
1. Get Data
2. Validate Data
3. Missing Data
4. Machine Learning
5. Display Findinds
So for missing data, there are techniques to restore the data, what are they and what libraries are used?
You can handle missing data in a few ways such as:
1. Remove rows with missing data.
2. Impute missing data (e.g. use the Imputer class in sklearn)
3. Use methods that support missing data (e.g. decision trees)
I hope that helps.
Hi Jason,
I am a Non Tech Data Analyst and use SPSS extensively on Academic / Business Data over the last 6 years.
I understand the above example very easily.
I want to work on Search – Language Translation and develop apps.
Whats the best way forward …
Do you also provide Skype Training / Project Mentoring..
Thanks in advance.
Thanks Mohammed.
Sorry, I don’t have good advice for language translation applications.
I dont have any Development / Coding Background.
However, following your guidelines I downloaded SciPy and tested the code.
Everything worked perfectly fine.
Looking forward to go all in…
I’m glad to hear that Mohammed
Hi Jason,
I am new to Machine learning and am trying out the tutorial. I have following environment :
>>> import sys
>>> print(‘Python: {}’.format(sys.version))
Python: 2.7.10 (default, Jul 13 2015, 12:05:58)
[GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)]
>>> import scipy
>>> print(‘scipy: {}’.format(scipy.__version__))
scipy: 0.18.1
>>> import numpy
>>> print(‘numpy: {}’.format(numpy.__version__))
numpy: 1.12.0
>>> import matplotlib
>>> print(‘matplotlib: {}’.format(matplotlib.__version__))
matplotlib: 2.0.0
>>> import pandas
>>> print(‘pandas: {}’.format(pandas.__version__))
pandas: 0.19.2
>>> import sklearn
>>> print(‘sklearn: {}’.format(sklearn.__version__))
sklearn: 0.18.1
When I try to load the iris dataset, it loads up fine and prints dataset.shape but then my python interpreter hangs. I tried it out 3-4 times and everytime it hangs after I run couple of commands on dataset.
>>> url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
>>> names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
>>> dataset = pandas.read_csv(url, names=names)
>>> print(dataset.shape)
(150, 5)
>>> print(dataset.head(20))
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
>>> print(datase
It does not let me type anything further.
I would appreciate your help.
Thanks,
Purvi
Hi Purvi, sorry to hear that.
Perhaps you’re able to comment out the first parts of the tutorial and see if you can progress?
Hi Jason
i am planning to use python to predict customer attrition.I have current list of attrited customers with their attributes.I would like to use them as test data and use them to predict any new customers.Can you please help to approach the problem in python ?
my test data :
customer1 attribute1 attribute2 attribute3 … attrited
my new data
customer N, attribute 1,…… ?
Thanks for your help in advance.
Hi Sam, as a start, this process will help you clearly define and work through your predictive modeling problem:
https://machinelearningmastery.com/start-here/#process
I’m happy to answer questions as you work through the process.
Hello Sir, I want to check my data is how many % accurate, In my data , I have 4 columns ,
Taluka , Total_yield, Rain(mm) , types_of soil
Nasik 12555 63.0 dark black
Igatpuri 1560 75.0 shallow
So on,
first, I have to check data is accurate or not, and next step is what is the predicted yield , using regression model.
Here is my model Total_yield = Rain + types_of soil
I use 0 and 1 binary variable for types_of soil.
can you please help me, how to calculate data is accurate ? How many % ?
and how to find predicted yield ?
I’m not sure I understand Kiran.
This process will help you describe and work through your predictive modeling project:
https://machinelearningmastery.com/start-here/#process
# Load dataset
url = “https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data”
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
dataset = pandas.read_csv(url, names=names)
The dataset should load without incident.
If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing url to the local file name.
I am a very beginner python learner(trying to learn ML as well), I tried to load data from my local file but could not be successful. Will you help me out how exactly code should be written to open the data from local file.
Sure.
Download the file as iris.data into your current working directory (where your python file is located and where you are running the code from).
Then load it as:
Hi, Jason, first of all thank so much for this amazing lesson.
Just for curiosity I have computed all the values obtained with dataset.describe() with excel and for the 25% value of petal-length I get 1.57500 instead of 1.60000. I have googled for formatting describe() output unsuccessfully. Is there an explanation? Tnx
Not sure, perhaps you could look into the Pandas source code?
OK, I will do.
HI Jason
I don’t quite follow the KFOLD section ?
We started of with 150 data-entries(rows)
We then use a 80/20 split for validation/training that leaves us with 120
The split 10 boggles me ??
Does it take 10 items from each class and train with 9 ? what does the other 1 left do then ?
Hi jacques,
The 120 records are split into 10 folds. The model is trained on the first 9 folds and evaluated on the records in the 10th. This is repeated so that each fold is given a chance to be the hold out set. 10 models are trained, 10 scores collected and we report the mean of those scores as an estimate of the performance of the model on unseen data.
Does thar help?
I am trying to integrate machine learning into a PHP website I have created. Is there any way I can do that using the guidelines you provided above?
I have not done this Alhassan.
Generally, I would advise developing a separate service that could be called using REST calls or similar.
If you are working on a prototype, you may be able to call out to a program or script from cgi-bin, but this would require careful engineering to be secure in a production environment.
Hi Jason! This tutorial was a great help, i´m truly grateful for this so thank you.
I have one question about the tutorial though, in the Scattplot Matrix i can´t understand how can we make the dots in the graphs whose variables have no relationship between them (like sepal-lenght with petal_width).
Could you or someone explain that please? how do you make a dot that represents the relationship between a certain sepal_length with a certain petal-width
Hi Simão,
The x-axis is taken for the values of the first variable (e.g. sepal_length) and the y-axis is taken for the second variable (e.g. petal_width).
Does that help?
you match each iris instance’s length and width with each other. for example, iris instance number one is represented by a dot, and the dot’s values are the iris length and width! so actually, when you take all these values and put them on a graph you are basically checking to see if there is a relation. as you can see some in some of these plots the dots are scattered all around, but when we look at the petal width – petal length graph it seems to be linear! this means that those two properties are clearly related. hope this hepled!
Hi Jason,
from France and just to say you “Thank you for this very clear tutorial!”
Sébastien
I’m glad you found it useful Sébastien.
Hi Jason,
I am new to ML & Python. Your post is encouraging and straight to the point of execution. Anyhow, I am facing below error when
>>> validataion_size = 0.20
>>> X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state = seed)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘validation_size’ is not defined
What could be the miss out? I din’t get any errors in previous steps.
My Environment details:
OS: Windows 10
Python : 3.5.2
scipy : 0.18.1
numpy : 1.11.1
sklearn : 0.18.1
matplotlib : 0.18.1
Hi Raj,
Double check you have the code from section “5.1 Create a Validation Dataset” where validation_size is defined.
I hope that helps.
Hey Jason,
Can you please explain what precision,recall, f1-score, support actually refer to?
Also what the numbers in a confusion matrix refers to?
[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]
Thanks.
Hi Roy,
You can learn all about the confusion matrix in this post:
https://machinelearningmastery.com/confusion-matrix-machine-learning/
You can learn all about precision and recall in this article:
https://en.wikipedia.org/wiki/Precision_and_recall
Hi Jason,
Thank you very much for your tutorial.
I am a little bit confused about the confusion matrix, because you are using a 3×3 matrix while it should be a 2×2 matrix.
Learn more about the confusion matrix here:
https://machinelearningmastery.com/confusion-matrix-machine-learning/
Hi Jason,
Now I unserstand the meaning of your confusion matrix, so I don’t need any explanation.
Thank you and best regards.
You’re welcome.
what code should i use to load data from my working directory??
This post will help you out Santosh:
https://machinelearningmastery.com/load-machine-learning-data-python/
Hi Jason,
I have a ValueError and i don’t know how can i solve this problem
My problem like that,
ValueError: could not convert string to float: ‘2013-06-27 11:30:00.0000000’
Can u give some information abaut the fixing this problem?
Thank you
It looks like you are trying to load a date-time. You might need to write a custom function to parse the date-time when loading or try removing this column from your dataset.
>>> for name, model in models:
… kfold=model_selection.Kfold(n_splits=10, random_state=seed)
… cv_results =model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
… results.append(cv_results)
… names.append(name)
… msg=”%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
… print(msg)
…
After typing this piece of code, it is giving me this error. can you plz help me out Jason. Since I am new to ML, dont have so much idea about the error.
Traceback (most recent call last):
File “”, line 2, in
AttributeError: module ‘sklearn.model_selection’ has no attribute ‘Kfold’
the KFold function is case-sensitive. It is ” model_selection.KFold(…) ” not ” model_selection.Kfold(…) ”
update this line:
kfold=model_selection.KFold(n_splits=10, random_state=seed)
THANK U
Hello Jason ,
Thanks for writing such a nice and explanatory article for beginners like me but i have one concern , i tried finding it out on other websites as well but could not come up with any solution.
Whatever i am writing inside the code editor (Jupyter Qtconsole in my case) , can this not be save as a .py file and shared with my other members over github maybe?. I found some hacks though but i have a thinking that there must be some proper way of sharing the codes written in the editor. , like without the outputs or plots in between.
You can write Python code in a text editor and save it as a myfile.py file. You can then run it on the command line as follows:
Consider picking up a book on Python.
Hello Jason,
Nice tutorials I done this today.
I didn’t really understand everything, { I will follow your advice, will do it again, write all the question down, and use the help function.}
The tutorials just works, I take around 2 hours to do it typing every single line.
install all the dependencies, run on each blocks types, to check.
Thanks, I be visiting your blogs, time to time.
Regards,
Well done, and thanks for your support.
Post any questions you have as comments or email me using the “contact” page.
Just I am a beginner too, I am using Visual studio code.
Look good.
What exactly is confusion matrix?
Great question, see this post:
https://machinelearningmastery.com/confusion-matrix-machine-learning/
Can I ask what is the reason of this problem? Thank for answer 🙂 :
(In my code is just the section, where I Import all the needed libraries..)
I have all libraries up to date, but it still gives me this error->
File “C:\Users\64dri\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py”, line 32, in
from ..utils.fixes import rankdata
ImportError: cannot import name ‘rankdata’
( scipy: 0.18.1
numpy: 1.11.1
matplotlib: 1.5.3
pandas: 0.18.1
sklearn: 0.17.1)
Sorry, I have not seen this issue Dan, consider searching or posting to StackOverflow.
Jason,
You’re a rockstar, thank you so much for this tutorial and for your books! It’s been hugely helpful in getting me started on machine learning. I was curious, is it possible to add a non-number property column, or will the algorithms only accept numbers?
For example, if there were a “COLOR” column in the iris dataset, and all Iris-Setosa were blue. how could I get this program to accept and process that COLOR column? I’ve tried a few things and they all seem to fail.
Great question Cameron!
sklearn requires all input data to be numbers.
You can encode labels like colors as integers and model that.
Further, you can convert the integers to a binary encoding/one-hot encoding which may me more suitable if there is no ordinal relationship between the labels.
Jason, thanks so much for replying! That makes a lot of sense. When you say binary/one-hot encoding I assume you mean (continuing to use the colors example) adding a column for each color (R,O,Y,G,B,V) and for each flower putting a 1 in the column of it’s color and a 0 for all of the other colors?
That’s feasible for 6 colors (adding six columns) but how would I manage if I wanted to choose between 100 colors or 1000 colors? Are there other libraries that could help deal with that?
Yes you are correct.
Yes, sklearn offers LabelEncoder and OneHotEncoder classes.
Here is a tutorial to get you started:
https://machinelearningmastery.com/data-preparation-gradient-boosting-xgboost-python/
Awesome! thanks so much Jason!
You’re welcome, let me know how you go.
for name, model in models:
… kfold = cross_vaalidation.KFold(n=num_instances,n_folds=num_folds,random_state=seed)
… cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
File “”, line 3
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
^
SyntaxError: invalid syntax
>>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘model’ is not defined
>>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv=kfold, scoring=scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘kfold’ is not defined
>>> cv_results = model_selection.cross_val_score(models, X_train, Y_train, cv =
kfold, scoring = scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘kfold’ is not defined
>>> names.append(name)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘name’ is not defined
I am new to python and getting these errors after running 5.3 models. Please help me.
It looks like you might not have copied all of the code required for the example.
Hi, I went through your tutorial. It is super great!
I wonder whether you can recommend a data set that is similar to Iris classification for me to practice?
Thanks Mier,
I recommend some datasets here:
https://machinelearningmastery.com/practice-machine-learning-with-small-in-memory-datasets-from-the-uci-machine-learning-repository/
Hi Jason,
That’s an amazing tutorial, quite clear and useful.
Thanks a bunch!
Thanks Medine.
Hi Jason,
Can you let me know how can I start with Fraud Detection algorithms for a retail website ?
Thanks,
Sean
Hi Sean, this process will help you work through your problem:
https://machinelearningmastery.com/start-here/#process
You are doing great with your work.
I need your suggestion, i am working on my thesis here i need to work on machine learning.
Training : positive ,negative, others
Test : unknown data
Want to train machine with training and test with unknown data using SVM,Naive,KNN
How can i make the format of training and test data ?
And how to use those algorithms in it
Using which i can get the TP,TN,FP,FN
Thanking you..
This article might help:
https://en.wikipedia.org/wiki/Precision_and_recall
I m new in Machine learning and this was a really helpful tutorial. I have maybe a stupid question I wanted to plot the predictions and the validation value and make a visual comparison and it doesn’t seem like I really understood how I can plot it.
Can you please send me the piece of code with some explanations to do it ?
thank you very much
You can use matplotlib, for example:
Thanks a lot. It was very helpful.
You’re welcome Kamol, I’m glad to hear it.
Hi
Sorry for a dumb question.
Can you briefly describe, what the end result means (i.e.. what the program has predicted)
Given an input description of flower measurements, what species of flower is it?
We are predicting the iris flower species as one of 3 known species.
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)
Why am I getting the highest accuracy for SVM?
I’m a beginner, there was a similar query above but I couldn’t quite understand your reply.
Could you please help me out? Have I done any mistake?
Why is a very hard question to answer.
Our role is to find what works, ensure the results are robust, then figure out how we can use the model operationally.
Okay. Thanks a lot for the prompt response!
The tutorial was very helpful.
Glad to hear it Anusha.
Great tutorial Jason!
My question is, if I want some new data from a user, how do I do that? If in future I develop my own machine learning algorithm, how do I use it to get some new data?
What steps are taken to develop it?
And thanks for this tutorial.
Not sure I understand. Collect new data from your domain and store it in a CSV or write code to collect it.
Hi Jason,
I have a question regards the step after trained the data and know the better algorithm for our case, how we could know the rules formula that the algorithm produced for future uses ?
and thanks for the tutorial, its really helpful
You can extract the weights if you like. Not sure I understand why you want the formula for the network. It would be complex and generally unreadable.
You can finalize the mode, save the weights and topology for later use if you like.
the best algorithm results for my use case was the “Classification and Regression Trees (CART)”, so how could I know the rules that the algorithm created on my usecase.
how I could extract the weights and use them for evaluate new data .
Thanks for your prompt response
See this post on how to finalize your model:
https://machinelearningmastery.com/train-final-machine-learning-model/
Thank you so much…this document really helped me a lot…..i was searching for such a document since a long time…this document gave the actual view of how machine learning is implemented through python….Books and courses are really difficult to understand completely and begin with development of project on such a vast concept… books n videos gave me lots of snippets, but i was not understanding how they all fit together.
I’m glad to hear that.
can i get such more tutorials for more detailed understanding?……..It will be really helpfull.
Sure, see here:
https://machinelearningmastery.com/start-here/#python
Can’t load the iris dataset either through the url or copied to working folder without the NameError: name ‘pandas’ is not defined
You need to install the Pandas library.
See this tutorial:
https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
I’ve already installed Anaconda with Python 3.6 and the panda libraries are listed when I run versions.py. Everything has been fine up till trying to load the iris library. Do I need to use a different terminal within Anaconda?
You may need to close and re-open the terminal window, or maybe restart your system after installation.
add a line
import pandas
at the top
Thanks Sunil!
Hi Jason,
Your tutorial is fantastic!
I’m trying to follow it but gets stuck on 5.3 Build Models
When I copy your code for this section I get a few Errors
IndentationError: excpected an indented block
NameError: name ‘model’ is not defined
NameError: name ‘cv_results’ is not defined
NameError: name ‘name’ is not defined
Could you please help me find what I’m doing wrong?
Thanks!
see the code and my “results” below:
>>> # Spot Check Algorithms
… models = []
>>> models.append((‘LR’, LogisticRegression()))
>>> models.append((‘LDA’, LinearDiscriminantAnalysis()))
>>> models.append((‘KNN’, KNeighborsClassifier()))
>>> models.append((‘CART’, DecisionTreeClassifier()))
>>> models.append((‘NB’, GaussianNB()))
>>> models.append((‘SVM’, SVC()))
>>> # evaluate each model in turn
… results = []
>>> names = []
>>> for name, model in models:
… kfold = model_selection.KFold(n_splits=10, random_state=seed)
File “”, line 2
kfold = model_selection.KFold(n_splits=10, random_state=seed)
^
IndentationError: expected an indented block
>>> cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘model’ is not defined
>>> results.append(cv_results)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘cv_results’ is not defined
>>> names.append(name)
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘name’ is not defined
>>> msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
Traceback (most recent call last):
File “”, line 1, in
NameError: name ‘name’ is not defined
>>> print(msg)
Make sure you have the same tab indenting as in the example. Maybe re-add the tabs yourself after you copy-paste the code.
I’m having this same problem. How would I add the Indentations after I paste the code? Whenever I paste the code, it automatically executes the code.
How to copy code from the tutorial:
1. Click the copy button on the code example (top right of code box, second from the end). This will select all code in the box.
2. Copy the code to the cipboard (control-c on windows, command-c on mac, or right click and click copy).
3. Open your text editor.
4. Paste the code from the clip board.
This will preserve all white space.
Does that help?
Hi, one beginner question. What do we get after training is completed in supervised learning, for classification problem ? Do we get weights? How do i use the trained model after that in field, for real classification application lets say? I didn’t get the concept what happens if training is completed. I tried this example: https://github.com/fchollet/keras/blob/master/examples/mnist_mlp.py and it printed me accuracy and loss of test data. Then what now?
See this post on how to train a final model:
https://machinelearningmastery.com/train-final-machine-learning-model/
Wow… It’s really great stuff man…. Thanks you….
I’m glad to hear that.
As a complete beginner, it sounds so cool to predict the future. Then I saw all these model and complicated stuff, how do I even begin. Thank you for this. It is really great!
You’re very welcome.
Hello Jason,
I just started following your step by step tutorial for machine learning. In importing libraries step I followed each and every steps you specified, install all libraries via conda, but still I’m getting the following error.
Traceback (most recent call last):
File “C:/Users/dell/PycharmProjects/machine-learning/load_data.py”, line 13, in
from sklearn.linear_model import LogisticRegression
File “C:\Users\dell\Anaconda2\lib\site-packages\sklearn\linear_model\__init__.py”, line 15, in
from .least_angle import (Lars, LassoLars, lars_path, LarsCV, LassoLarsCV,
File “C:\Users\dell\Anaconda2\lib\site-packages\sklearn\linear_model\least_angle.py”, line 24, in
from ..utils import arrayfuncs, as_float_array, check_X_y
ImportError: DLL load failed: Access is denied.
Can you please help me with this?
Thank You!
I have not seen this error and I don’t know about windows sorry.
It looks like you might not have admin permissions on your workstation.
Tutorial DEAP Version 2.1
https://www.youtube.com/watch?v=drd11htJJC0
A Data Envelopment Analysis (Computer) Program. This page describes the computer program Tutorial DEAP Version 2.1 which was written by Tim Coelli.
Thanks for sharing the link.
Good afternoon Dr. Jason could help me with the next problem. How could you modify the KNN algorithm to detect the most relevant variables?
You can use feature importance scores from bagged trees or gradient boosting.
Consider using sklearn to calculate and plot feature importance.
Thank u…
I’m glad the post helped.
Hi Jason
Thanx for the great tutorial you provided.
I’m also new to MC and python. I tried to use my csv file as you used iris data set. Though it successfully loaded the dataset gives following error.
could not convert string to float: LipCornerDepressor
LipCornerDepressor is normal value such as 0.32145 in excel sheet taken from sql server
Here is the code without library files.
# Load dataset
url = “F:\FINAL YEAR PROJECT\Amila\FTdata.csv”
names = [‘JawLower’, ‘BrowLower’, ‘BrowRaiser’, ‘LipCornerDepressor’, ‘LipRaiser’,’LipStretcher’,’Emotion_Id’]
dataset = pandas.read_csv(url, names=names)
# shape
print(dataset.shape)
# class distribution
print(dataset.groupby(‘Emotion_Id’).size())
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
# Test options and evaluation metric
seed = 7
scoring = ‘accuracy’
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
This error might be specific to your data.
Consider double checking that your data is loaded as you expect. Maybe print some raw data or plots to confirm.
Thank you very much for the easy to follow tutorial.
I’m glad you found it useful.
Hi, Jason
Your posts are really good…..
I’m very naive to Python and Machine Learning.
Can you please suggest good reads to get basic clear for machine learning.
Thanks.
A good place to start for python machine learning is here:
https://machinelearningmastery.com/start-here/#python
I hope that helps.
Outstanding work on this. I am curious how to port out results that show which records were matched to what in the predictor, when I print(predictions) it does not show what records they are paired with. Thanks!
Thanks!
The index can be used to align predictions with inputs. For example, the first prediction is for the first input, and so on.
when I am applying all the models and printing message it shows me the error that it cannot convert string to float. how to resolve this error. my data set is related to fake news … title, text, label
Ensure you have converted your text data to numerical values.
Awesome tutorial on basics of machine learning using Python. Thank you Jason!
Thanks Shravan.
Am using Anaconda Python and I was writing all the commands/ program in the ‘python’ command line, am trying to find a way to save this program to a file? I have tried ‘%save’, but it errored out, any thoughts?
You can write your programs in a text file then run them on the command line as follows:
Thank you for the help and insight you provide. When I run the actual validation data through the algorithms, I get a different feel for which one may be the best fit.
Validation Test Accuracy:
LR…….0.80
LDA…..0.97
KNN….0.90
CART..0.87
NB…….0.83
SVM….0.93
My question is, should this influence my choice of algorithm?
Thank you again for providing such a wealth of information on your blog.
Tes it should.
ML algorithms are stochastic and you need to evaluate them in such a way to take this int account.
This post might clarify what I mean:
https://machinelearningmastery.com/randomness-in-machine-learning/
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
from my dataset , When i give Y=array[:,1] Its working , but if give 2 or 3 or 4 instead of 1 it gives following error !!
But all columns have similar kind of data .
Traceback (most recent call last):
File “/alok/c-analyze/analyze.py”, line 390, in
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
for train, test in cv_iter)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
while self.dispatch_one_batch(iterator):
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
self._dispatch(tasks)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
result = ImmediateResult(func)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
self.results = batch()
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 468, in fit
self._solve_svd(X, y)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 378, in _solve_svd
fac = 1. / (n_samples – n_classes)
ZeroDivisionError: float division by zero
Perhaps take a closer look at your data.
But the very similar in all the columns .
I meant there is no much difference in data from each columns ! but still its working only for first column !! It gives the above error for any other column i choose .
Have a look at the data :
index,1column,2 column,3column,….,8column
0,238,240,1103,409,1038,4,67,0
1,41,359,995,467,1317,8,71,0
2,102,616,1168,480,1206,7,59,0
3,0,34,994,181,1115,4,68,0
4,88,1419,1175,413,1060,8,71,0
5,826,10886,1316,6885,2086,263,119,0
6,88,472,1200,652,1047,7,64,0
7,0,322,957,533,1062,11,73,0
8,0,200,1170,421,1038,5,63,0
9,103,1439,1085,1638,1151,29,66,0
10,0,1422,1074,4832,1084,27,74,0
11,1828,754,11030,263845,1209,10,79,0
12,340,1644,11181,175099,4127,13,136,0
13,71,1018,1029,2480,1276,18,66,1
14,0,3077,1116,1696,1129,6,62,0
“”””””
‘”””””
Total 105 data records
But the above error does not occur for 1 column , that is when Y = 1 column,
But the above same error happens when i choose any other column 2 , 3 or 4 .
How to plot the graph for actual value against the predicted value here ?
How to save this plotted graphs and again view them back when required from terminal itself ?
It would make for a dull graph as this is a classification problem.
You might be better of reviewing the confusion matrix of a set of predictions.
How this can be applied to predict the value if stastical dataset is given
Say i have given with past 10 years house price now i want to predict the value for house in next one year, two year
Can you help me out in this
I m amature in ML
Thank for this tutorial
It gives me a good kickstart to ML
I m waiting for your reply
This is called a time series forecasting problem.
You can learn more about how to work through time series forecasting problems here:
https://machinelearningmastery.com/start-here/#timeseries
I getting trouble in doing that please help me out with any simple example
Example I have a dataset containing plumber work Say
attributes are
experience_level , date, rating, price/hour
I want to predict the price/hour for the next date base on experience level and average rating can you please help me regarding this.
Sorry, I cannot write an example for you.
Great job with the tutorial, it was really helpful.
I want to ask, how can I use the techics above with a dataset that is not just one line with a few values, but a matrix NX3 with multiple values (measurements from an accelerometer). Is there a tutorial? How can I look up to it?
Each feature would be a different input variable as in the example above.
Hey Jason,
I have built a linear regression model. y intercept is abnormally high (0.3 million) and adjusted r2 = 0.94. I would like to know what does high intercept mean?
Think of the intercept as the bias term.
Many books have been written on linear regression and much is known about how to analyze these models effectively. I would recommend diving into the statistics literature.
Excellent tutorial, i am moving from PHP to Python and taking baby steps. I used the Thonny IDE (http://thonny.org/) which is also very useful for python beginners.
Thanks for sharing.
Thank you so much, Jason! I’m new to machine learning and python but found your tutorial extremely helpful and easy to follow – thank you for posting!
Thanks Tmoe, I’m really glad to hear that!
Thanks for all,now I am starting use ML!!!
I’m glad to hear that!
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
When i print models , this is the output :
[(‘LR’, LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class=’ovr’, n_jobs=1,
penalty=’l2′, random_state=None, solver=’liblinear’, tol=0.0001,
verbose=0, warm_start=False)), (‘LDA’, LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
solver=’svd’, store_covariance=False, tol=0.0001)), (‘KNN’, KNeighborsClassifier(algorithm=’auto’, leaf_size=30, metric=’minkowski’,
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights=’uniform’))
What are these extra values inside LogisticRegression (…) and for all the other algorithms ?
How did they get appended ?
You can learn about them in the sklearn API:
http://scikit-learn.org/stable/modules/classes.html
When i print kfold :
KFold(n_splits=7, random_state=7, shuffle=False)
What is shuffle ? How did this value get added , as we had only done this :
kfold = model_selection.KFold(n_splits=10, random_state=seed)
Whether or not to shuffle the dataset prior to splitting into folds.
Now i understand , jason thanks for amazing tutorials . Just one suggestion along with the codes give a link for reference in detail about this topics !
Great suggestion, thanks pasha.
Hello jason
This is an amazing blog , Thank you for all the posts .
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
Whats scoring here ? can you explain in detail ” model_selection.cross_val_score ” this line please .
Thanks sita.
Learn more here:
http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score
Please help me with this error Jason ,
ERROR :
Traceback (most recent call last):
File “/rahman/c-analyze/analyze.py”, line 390, in
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
for train, test in cv_iter)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
while self.dispatch_one_batch(iterator):
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
self._dispatch(tasks)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
result = ImmediateResult(func)
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
self.results = batch()
File “/usr/lib64/python2.7/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/usr/lib64/python2.7/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 468, in fit
self._solve_svd(X, y)
File “/usr/lib64/python2.7/site-packages/sklearn/discriminant_analysis.py”, line 378, in _solve_svd
fac = 1. / (n_samples – n_classes)
ZeroDivisionError: float division by zero
# Split-out validation dataset
My code :
array = dataset.values
X = array[:,0:4]
if field == “rh”: #No error if i select this col
Y = array[:,0]
elif field == “rm”: #gives the above error
Y = array[:,1]
elif field == “wh”: #gives the above error
Y = array[:,2]
elif field == “wm”: #gives the above error
Y = array[:,3]
Have a look at the data :
index,1column,2 column,3column,….,8column
0,238,240,1103,409,1038,4,67,0
1,41,359,995,467,1317,8,71,0
2,102,616,1168,480,1206,7,59,0
3,0,34,994,181,1115,4,68,0
4,88,1419,1175,413,1060,8,71,0
5,826,10886,1316,6885,2086,263,119,0
6,88,472,1200,652,1047,7,64,0
7,0,322,957,533,1062,11,73,0
8,0,200,1170,421,1038,5,63,0
9,103,1439,1085,1638,1151,29,66,0
10,0,1422,1074,4832,1084,27,74,0
11,1828,754,11030,263845,1209,10,79,0
12,340,1644,11181,175099,4127,13,136,0
13,71,1018,1029,2480,1276,18,66,1
14,0,3077,1116,1696,1129,6,62,0
“”””””
‘”””””
Total 105 data records
But the above error does not occur for 1 column , that is when Y = 1 column,
But the above same error happens when i choose any other column 2 , 3 or 4 .
Perhaps try scaling your data?
Perhaps try another algorithm?
fac = 1. / (n_samples – n_classes)
ZeroDivisionError: float division by zero
What is this error : fac = 1. / (n_samples – n_classes) ?
Where is n_samples and n_classes used ?
What may be the possible reason for this error ?
thank you Dr Jason it is really very helpfully. 🙂
You’re welcome bob, I’m glad to hear that!
Hi Jason
Great starting tutorial to get the whole picture. Thank you:)
I am a newbie to machine learning. Could you please tell why you have specifically chosen these 6 models?
No specific reason, just a demonstration of spot checking a suite of methods on the problem.
Hi Jason, I am new to Python, but found this blog really helpful. I tried executing the code and it return all the result as mention above by you, except few graph.
The scatter matrix graph and the evaluation on 6 algorithm did not open on my machine but its showing result on my colleague machine. I checked all the version and its higher or same as you mentioned in blog.
Can you help if this issue can be resolved on my machine?
Perhaps check the configuration of matplotlib and ensure you can create simple graphs on your machine?
Great tutorial.
How do I approach when the data set is not of any classification type and the number of attributes or just 2 – 1 is input and the other is output
say I have number of processes as input and cpu usage as output..
data set looks like [10, 5] [15, 7] etc…
If the output is real-valued, it would be a regression problem. You would need to use a loss function like MSE.
Many thanks for this — I already got a lot out of this. I feel like a monkey though because I was neither familiar enough with python nor had any clue of ML back alleys yesterday. Today I can see plots on my screen and even if I have no clue what I’m looking at, this is where I wanted to be, so thanks!
A few minor suggestions to make this perhaps even more dummy-proof:
– I’m on Mac and I used python3 because python2 is weirdly set up out of the box and you can’t update easily the libraries needed. I understand you link, rightfully to external installation instructions, so just to say, this stuff works in python3 if you needed further testimony.
– when drawing plots, I started freaking out because the terminal became unresponsive. So if you just made an (unessential) suggestion to run plt.ion() first, linking to, for example: https://matplotlib.org/faq/usage_faq.html#what-is-interactive-mode, it might help dummies like me to not give up too easily. (BTW I find your use command line philosophy and don’t let toolsets get in the way a great one indeed!)
– There seems to be some ‘hack’ involved when defining the dataset, suppose there are no headers and so on… how do you get to load your dataset with an insightful name vector in the first palce (you don’t…) So just a hint of clarification would help here feeling we can trust that we do the right thing in this case because the data is well understood (I mean, this is not really a big deal eh it’s all par for the course but if I didn’t have similar experience in R I’d feel completely lost I think).
I was a bit puzzled by the following sentence in 3.3:
“We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.”
Well, just looking at the table, I actually can’t see any of this. There is in fact really nothing telling this to us in the snippet, right? The sentence is a comment based on prior understanding of the dataset. Maybe this could be clarified so clueless readers don’t agonise over whether they are missing some magical power of insight.
– Overall, I could run this and to some extent adapt it quickly to a different dataset until it became relevant what the data was like. I’m stumbling on the data manipulation for 5.1. I suppose it is both because I don’t know python structures and also because I have no clue what is being done in the selection step.
I think in answer to a previous comment you link to doc for the relevant selection function, perhaps it would still be useful to have an extra, ‘for dummies’, detailed explanation of
X = array[:,0:4]
Y = array[:,4]
in the context of the iris dataset. This is what I have to figure out, I think, in order to apply it to say, a 11 column dataset and it would be useful to known what I’m trying to do.
The rest of the difficulties I have are with regards to interpretation of the output and it is fair to say this is outside of the scope of your tutorial which puts dummies like me in a very good position to try to understand while being able to fiddle with a bit of code. All the above comments are extremely minor and really about polishing the readibility for ultimate noobs, they are not really important and your tutorial is a great and efficient resource.
Thanks again!
Pierre
Wonderful feedback pierre, thank you so much!
I’m not able to figure out , what errors does the confusion matrix represents ? and what does each column(precision, recall, f1-score, support) in the classification report signifies ?
And last but not the least thanks a lot Sir for this easy to use and wonderful tutorial. Even words are not enough to express my gratitude, you have made a daunting task for every ML Enthusiast a hell lot easier !!!
You can learn more about the confusion matrix here:
https://machinelearningmastery.com/confusion-matrix-machine-learning/
Thanks a lot Sir. Please suggest some data-sets from UCL repository on which I can practice some small projects…
See here:
https://machinelearningmastery.com/practice-machine-learning-with-small-in-memory-datasets-from-the-uci-machine-learning-repository/
How do you classify problem into different categories example : Iris dataset was a classification problem and pima-indian-diabetes ,a binary problem. How can we figure out which problem belong to which category and which model to apply on that problem?
By careful evaluation of the output variable.
Is this machine learning? what does the machine learn in this example? This is just plain Statistics, used in a weird way…
Yes, it is.
Nominally, statistics is about understanding the data, machine learning about making predictions at the cost of understanding.
your question can be answered like this…
consider the formula for area of triangle 1/2 x base x height. When you learn this formula, you understand it and apply it many times for different triangles. BUT you did not learn anything ABOUT the formula itself. . for instance, how many people care that the formula has 2 variables(base and height) and that there is no CONSTANT(like PI) in the formula and many such things about the formula itself? Applying the formula does not teach anything about the nature of the formula itself
A lot of program execution in computers happen much the same way…data is a thing to be modified, applied or used, but not necessarily understood. When you introduce some techniques to understand data, then necessarily the computer or the ‘Machine’ ‘learns’ that there are characteristics about that data, and that at the least, there exists some relationship amongst data in their dataset. This learning is not explicitly programmed rather inferenced, although confusingly, the algorithms themselves are explicitly programmed to infer the meaning of the dataset. The learning is then transferred to the end cycle of making prediction based on the gained understanding of data.
but like you pointed out, it is still statistics and all it’s domain techniques, but as a statistician do you not ‘learn’ more about data than merely use it, unlike your counterparts who see data more as a commodity to be consumed? Because most computer systems do the latter(consumption) rather than the former(data understanding), a system that understands data(with prediction used as a proof of learning) can be called ‘Machine Learning’.
Thanks for good tutorial Jason.
Only issue I encountered is following error while cross validation score calculation for model KNeighborsClassifier() :
AttributeError: ‘NoneType’ object has no attribute ‘issparse’
Is somebody got same error? How it can be solved?
I have installed following versions of toos:
Python: 2.7.13 |Anaconda custom (64-bit)| (default, Dec 19 2016, 13:29:36) [MSC v.1500 64 bit (AMD64)]
scipy: 0.19.0
numpy: 1.12.1
matplotlib: 2.0.0
pandas: 0.19.2
sklearn: 0.18.1
Thanks,
Alex
Ouch, sorry I have not seen this issue. Perhaps search on stackoverflow?
HI, Jason!
How can i get the xgboost algorithm in pseudo code or in code?
You can read the code here:
https://github.com/dmlc/xgboost
I expect it is deeply confusing to read.
For an overview of gradient boosting, see this post:
https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/
Sir,I’ve been working on bank_note authentication dataset and after applying the above procedure carefully the results were 100% accuracy(both on trained and validation dataset) using SVM and KNN models. Is 100% accuracy possible or have I done something wrong ?
That sounds great.
If I were to get surprising results, I would be skeptical of my code/models.
Work hard to ensure your system is not fooling you. Challenge surprising results.
Sir, I’ve considered various other aspects like f1-score, recall, support ; but in each case the result is same 100%. How can I make sure that my system is not fooling me ? What other procedure can I apply to check the accuracy of my dataset ?
Get more data and see if the model can make accurate predictions.
Hi, Jason!
I am new to python as well ML. so I am getting the below error while running your code, please help me to code bring-up
File “sample1.py”, line 73, in
predictions = knn.predict(X_validation)
File “/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/classification.py”, line 143, in predict
X = check_array(X, accept_sparse=’csr’)
File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”, line 407, in check_array
_assert_all_finite(array)
File “/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py”, line 58, in _assert_all_finite
” or a value too large for %r.” % X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype(‘float64’).
and my config
Python: 2.7.6 (default, Oct 26 2016, 20:30:19)
[GCC 4.8.4]
scipy: 0.13.3
numpy: 1.8.2
matplotlib: 1.3.1
pandas: 0.13.1
sklearn: 0.18.1
running in Ubuntu Terminal.
You may have a NaN value in your dataset. Check your data file.
Hello. This is really an amazing tutorial. I got down to everything but when selecting the best model i hit a snag. Can you help out?
Traceback (most recent call last):
File “/Users/sahityasehgal/Desktop/py/machinetest.py”, line 77, in
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=’accuracy’)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/model_selection/_validation.py”, line 140, in cross_val_score
for train, test in cv_iter)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 758, in __call__
while self.dispatch_one_batch(iterator):
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 608, in dispatch_one_batch
self._dispatch(tasks)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 571, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 109, in apply_async
result = ImmediateResult(func)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/_parallel_backends.py”, line 326, in __init__
self.results = batch()
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/externals/joblib/parallel.py”, line 131, in __call__
return [func(*args, **kwargs) for func, args, kwargs in self.items]
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/model_selection/_validation.py”, line 238, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/linear_model/logistic.py”, line 1173, in fit
order=”C”)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py”, line 526, in check_X_y
y = column_or_1d(y, warn=True)
File “/Users/sahityasehgal/Library/Python/2.7/lib/python/site-packages/sklearn/utils/validation.py”, line 562, in column_or_1d
raise ValueError(“bad input shape {0}”.format(shape))
ValueError: bad input shape (94, 4)
Ouch. Are you able to confirm that you copied all of the code exactly?
Also, are you able to confirm that your sklearn is up to date?
Yes i coped the code exactly as on the site. sklearn: 0.18.1
thoughts?
I’m not sure but I expect it has something to do with your environment.
This tutorial may help with your environment:
https://machinelearningmastery.com/setup-python-environment-machine-learning-deep-learning-anaconda/
Very insightful Jason, thank you for the post!
I was wondering if the models can be saved to/loaded from file, to avoid re-training a model each time we wish to make a prediction.
Thanks,
Rene
Yes, see this post:
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
Mr. Brownlee,
This is, by far, is the most effective applied technology tutorial I have utilized.
You get right to the point and still have readers actually working with python, python libraries, IDE options, and of course machine learning. I am an electromechanical engineer with embedded C experience. Until now, I have been bogged down trying to traipse through python wizards’ idiosyncratic coding styles and verbose machine learning theory knowing there exists a friendlier path.
Thank you for showing me the way!
Rich
Thanks Rich, you made my day! I’m glad it helped.
This was very informative….Thank You !
Actually I was working on a project on twitter analysis using python where I am extracting user interests through their tweets. I was thinking of using naive bayes classifier in textblob python library for training classifier with different type of pre-labeled tweets or different categories like politics,sports etc.
My only concern is that will it be accurate as I tried passing like 10 tweets in training set and based on that I tried classifying my test set. I am getting some false cases and accuracy is around 85.
Good question, I’d suggest try it and see.
Hi Jason,
This was great example. I was looking for something similar on internet all this time,glad I found this link. I wanted to compile a ML code end-to-end and see my basic infra is ready to start with the actual course work. As you said, from here we can learn more about each algorithm in detail. It would be great if you can start a Youtube channel and upload some easy to learn videos as well related to ML, Deep learning and Neural Networks.
Regards,
Kush Singh
Thanks.
Take a look at the rest of my blog and my books. I am dedicated to this mission.
I’ve been working on a dataset which contains [Male,Female,Infant] as entries in first column rest all columns are integers. How can I replace [Male,Female,Infant] with a similar notation like [0,1,2] or something like that ? What is the most efficient way to do it ?
Excellent question.
Use a LabelEncoder:
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
I’m sure I have tutorials on this on my blog, try the blog search.
Sir, while loading dataset we have given the URl but what if we already have one and wants to load it ?
Change the URL to a filename and path.
Hi,
Nice tutorial, thanks!
Just a little precision if someone encounter the same issue than me:
if you get the error “This application failed to start because it could not find or load the Qt platform plugin “windows”
in “”.” when you are trying to see your data visualizations, it’s maybe (like in my case) because you are using PySide rather than PyQT.
In that case, add these lines before the “import matplotlib.pyplot as plt”:
import matplotlib
matplotlib.use(‘Qt4Agg’)
matplotlib.rcParams[‘backend.qt4′]=’PySide’
Hope this will help
Thanks for the tip Vincent.
Fantastic tutorial! Running today I noticed two changes from the tutorial above (undoubtably because time has passed since it was created). New users might find the following observations useful:
#1 – Future Warning
Ran on OS X, Python 3.6.1, in a jupyter notebook, anaconda 4.4.0 installed:
scipy: 0.19.0
numpy: 1.12.1
matplotlib: 2.0.2
pandas: 0.20.1
sklearn: 0.18.1
I replaced this line in the #Load libraries code block:
from pandas.tools.plotting import scatter_matrix
With this:
from pandas.plotting import scatter_matrix
…because a FutureWarning popped up:
/Users/xxx/anaconda/lib/python3.6/site-packages/ipykernel_launcher.py:2: FutureWarning: ‘pandas.tools.plotting.scatter_matrix’ is deprecated, import ‘pandas.plotting.scatter_matrix’ instead.
Note: it does run perfectly even without this fix, this may be more of an issue in the future
#2 – SVM wins!
In the build models section, the results were:
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.966667 (0.040825)
NB: 0.975000 (0.053359)
SVM: 0.991667 (0.025000)
… which means SVM was better here. I added the following code block based on the KNN one:
# Make predictions on validation dataset
svm = SVC()
svm.fit(X_train, Y_train)
predictions = svm.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
which gets these results:
0.933333333333
[[ 7 0 0]
[ 0 10 2]
[ 0 0 11]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 1.00 0.83 0.91 12
Iris-virginica 0.85 1.00 0.92 11
avg / total 0.94 0.93 0.93 30
I did also run the unmodified KNN block – # Make predictions on validation dataset – and got the exact results that were in the tutorial.
Excellent tutorial, very clear, and easy to modify 🙂
Thanks for sharing Danielle.
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 1.00 0.83 0.91 12
Iris-virginica 0.85 1.00 0.92 11
how to relate this result with input ? I mean, can i interactively provide the values for sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width and result to get whether it which class ?
Great question.
You can use a LabelEncoder to map the string class labels to integers, and keep the object to reverse the conversion back to strings for predictions.
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html
So this intro shows how to set everything up but not the actual interesting bit how to use it?
What do you mean exactly? Putting the model into production? See here:
https://machinelearningmastery.com/deploy-machine-learning-model-to-production/
Excellent tutorial sir, I love your tutorials and I am starting with deep learning with keras.
I would love if you could provide a tutorial for sequence to sequence model using keras and a relevant dataset.
Also I would be obliged if you could point me in some direction towards names entity recognition using seq2seq
I have one here:
https://machinelearningmastery.com/learn-add-numbers-seq2seq-recurrent-neural-networks/
Hi Jason,
Awesome tutorial. I am working on PIMA dataset and while using the following command
# head
print(dataset.head(20))
I am getting NAN. HEPL ME.
Confirm you downloaded the dataset and that the file contains CSV data with nothing extra or corrupted.
Hi Jason,
I downloaded the dataset from UCI which is a CSV file but still I get NAN.
# Load dataset url = “https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data”
Thanks..
Sorry, I do not see how this could be. Perhaps there is an issue with your environment?
Hello Jason,
Thank you for a great tutorial.
I have noticed something , which I would like to share with you.
I have tried with random_state = 4
“X_train,X_validation,Y_train,Y_validation = model_selection.train_test_split(X,Y, test_size = 0.2, random_state = 4)”
and surprisingly now “LDA” has the best accuracy.
LR: 0.966667 (0.040825)
LDA: 0.991667 (0.025000)
KNN: 0.975000 (0.038188)
CART: 0.958333 (0.055902)
NB: 0.950000 (0.055277)
SVM: 0.983333 (0.033333)
Any thoughts on this?
Machine learning algorithms are stochastic:
https://machinelearningmastery.com/randomness-in-machine-learning/
Hi Jason,
Thanks for your great example, this is really helpful, this end-to-end project is the best way to learn ML, much better than text-book which they only focus on the seperate concepts, not the whole forest, will you please do more example like this and explain in detail next time?
Thanks,
Rui
Thanks.
__init__() got an unexpected keyword argument ‘n_splites’
I am getting this error while running the code upto “print(msg)” commmand.
Can you please help me removing it.
Update your version of sklearn to 0.18 or higher.
This is beautiful tutorial for the starters..
I am a lover of machine learning and want to do some projects and research on it.
I would really need your help and guideline time to time.
Regards,
Fahad
Thanks.
Hi Jason,
Love the article. gave me a good start of understanding machine learning. One thing i would like to ask is what is the predicted outcome? Is it which type or “class” of flower that will happen next? i assume switching things up I could use this same outline as a way of getting a prediction on the other columns involved?
Yes, the prediction is a number that maps to a specific class of flower (string).
Correct, from the class and other measures you could predict width or something.
Hi again Jason,
Diving deeper into this tutorial and analyzing more I find something that peaked an interest maybe you can shed light on. based off the seed of 7 you get a higher accuracy percentage on the KNN algorithm after using kfold, but when showing the information for the LDA algorithm, it has a higher percentage in accuracy_score after predicting on it. what could this mean?
Machine learning algorithms are stochastic.
It is important to develop a robust estimate of the performance of machine learning models on unseen data using repeats. See this post:
https://machinelearningmastery.com/evaluate-skill-deep-learning-models/
Another great read Jason. This whole site is full of great pieces and it gives me a good answer on my question. I want to thank you for your time and effort into making such a great place for all this knowledge.
Thanks, I’m glad it helps Neal. Stick with it!
Hello Jason,
At the beginning of your tutorial you write: “If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.”
No offense but in this regards, your tutorial is not doing a very good job.
You don’t really go in detail so that we can understand what is been done and why. The explanations are rather weak.
Wrong expectations set i believe.
Cheers,
Thomas
It is a starting point, not a panacea.
Sorry that it’s not a good fit for you.
Hi Jason! I am trying to adapt this for a purely binary dataset, however I’m running into this problem:
# evaluate each model in turn
results = []
name = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train,cv = kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s:%f(%f)”%(name, cv_results.mean(), cv_results.std())
print(msg)
I get the error:
raise ValueError(“Unknown label type: %r” % y_type)
ValueError: Unknown label type: ‘unknown’
Am I missing something, any help would be great!
All necessary indentations are correct, it just pasted incorrectly
You can wrap pasted code in pre tags.
Sorry, the fault is not obvious to me.
Hello Mariah,
Did you ever get a solution to this problem?
Jason..great guide here..THANKS!
Hi. What should i do to make predictions based on my own test set.? Say i need to predict category of flower with data [5.2, 1.8, 1.6, 0.2]. ie i want to change my X_test to that array. And the prediction should be like “setosa”.
What changes should i do.? I tried giving that value directly to predict(). But it crashes.