How to Make Predictions with scikit-learn

By Jason Brownlee on January 10, 2020 in Python Machine Learning 217

How to predict classification or regression outcomes
with scikit-learn models in Python.

Once you choose and fit a final machine learning model in scikit-learn, you can use it to make predictions on new data instances.

There is some confusion amongst beginners about how exactly to do this. I often see questions such as:

How do I make predictions with my model in scikit-learn?

In this tutorial, you will discover exactly how you can make classification and regression predictions with a finalized machine learning model in the scikit-learn Python library.

After completing this tutorial, you will know:

How to finalize a model in order to make it ready for making predictions.
How to make class and probability predictions in scikit-learn.
How to make regression predictions in scikit-learn.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Updated Jan/2020: Updated for changes in scikit-learn v0.22 API.

Gentle Introduction to Vector Norms in Machine Learning
Photo by Cosimo, some rights reserved.

Tutorial Overview

This tutorial is divided into 3 parts; they are:

First Finalize Your Model
How to Predict With Classification Models
How to Predict With Regression Models

1. First Finalize Your Model

Before you can make predictions, you must train a final model.

You may have trained models using k-fold cross validation or train/test splits of your data. This was done in order to give you an estimate of the skill of the model on out-of-sample data, e.g. new data.

These models have served their purpose and can now be discarded.

You now must train a final model on all of your available data.

You can learn more about how to train a final model here:

How to Train a Final Machine Learning Model

2. How to Predict With Classification Models

Classification problems are those where the model learns a mapping between input features and an output feature that is a label, such as “spam” and “not spam.”

Below is sample code of a finalized LogisticRegression model for a simple binary classification problem.

Although we are using LogisticRegression in this tutorial, the same functions are available on practically all classification algorithms in scikit-learn.

# example of training a final classification model
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# fit final model
model = LogisticRegression()
model.fit(X, y)

# example of training a final classification model

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)

# fit final model

model = LogisticRegression()

model.fit(X, y)

After finalizing your model, you may want to save the model to file, e.g. via pickle. Once saved, you can load the model any time and use it to make predictions. For an example of this, see the post:

Save and Load Machine Learning Models in Python with scikit-learn

For simplicity, we will skip this step for the examples in this tutorial.

There are two types of classification predictions we may wish to make with our finalized model; they are class predictions and probability predictions.

Class Predictions

A class prediction is: given the finalized model and one or more data instances, predict the class for the data instances.

We do not know the outcome classes for the new data. That is why we need the model in the first place.

We can predict the class for new data instances using our finalized classification model in scikit-learn using the predict() function.

For example, we have one or more data instances in an array called Xnew. This can be passed to the predict() function on our model in order to predict the class values for each instance in the array.

Xnew = [[...], [...]]
ynew = model.predict(Xnew)

1 2	Xnew = [[...], [...]] ynew = model.predict(Xnew)

Multiple Class Predictions

Let’s make this concrete with an example of predicting multiple data instances at once.

# example of training a final classification model
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# fit final model
model = LogisticRegression()
model.fit(X, y)
# new instances where we do not know the answer
Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1)
# make a prediction
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
for i in range(len(Xnew)):
	print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

# example of training a final classification model

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)

# fit final model

model = LogisticRegression()

model.fit(X, y)

# new instances where we do not know the answer

Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1)

# make a prediction

ynew = model.predict(Xnew)

# show the inputs and predicted outputs

for i in range(len(Xnew)):

print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

Running the example predicts the class for the three new data instances, then prints the data and the predictions together.

X=[-0.79415228  2.10495117], Predicted=0
X=[-8.25290074 -4.71455545], Predicted=1
X=[-2.18773166  3.33352125], Predicted=0

X=[-0.79415228 2.10495117], Predicted=0

X=[-8.25290074 -4.71455545], Predicted=1

X=[-2.18773166 3.33352125], Predicted=0

Single Class Prediction

If you had just one new data instance, you can provide this as instance wrapped in an array to the predict() function; for example:

# example of making a single class prediction
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# fit final model
model = LogisticRegression()
model.fit(X, y)
# define one new instance
Xnew = [[-0.79415228, 2.10495117]]
# make a prediction
ynew = model.predict(Xnew)
print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))

# example of making a single class prediction

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)

# fit final model

model = LogisticRegression()

model.fit(X, y)

# define one new instance

Xnew = [[-0.79415228, 2.10495117]]

# make a prediction

ynew = model.predict(Xnew)

print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))

Running the example prints the single instance and the predicted class.

X=[-0.79415228, 2.10495117], Predicted=0

1	X=[-0.79415228, 2.10495117], Predicted=0

A Note on Class Labels

When you prepared your data, you will have mapped the class values from your domain (such as strings) to integer values. You may have used a LabelEncoder.

This LabelEncoder can be used to convert the integers back into string values via the inverse_transform() function.

For this reason, you may want to save (pickle) the LabelEncoder used to encode your y values when fitting your final model.

Probability Predictions

Another type of prediction you may wish to make is the probability of the data instance belonging to each class.

This is called a probability prediction where given a new instance, the model returns the probability for each outcome class as a value between 0 and 1.

You can make these types of predictions in scikit-learn by calling the predict_proba() function, for example:

Xnew = [[...], [...]]
ynew = model.predict_proba(Xnew)

1 2	Xnew = [[...], [...]] ynew = model.predict_proba(Xnew)

This function is only available on those classification models capable of making a probability prediction, which is most, but not all, models.

The example below makes a probability prediction for each example in the Xnew array of data instance.

# example of making multiple probability predictions
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)
# fit final model
model = LogisticRegression()
model.fit(X, y)
# new instances where we do not know the answer
Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1)
# make a prediction
ynew = model.predict_proba(Xnew)
# show the inputs and predicted probabilities
for i in range(len(Xnew)):
	print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

# example of making multiple probability predictions

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1)

# fit final model

model = LogisticRegression()

model.fit(X, y)

# new instances where we do not know the answer

Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1)

# make a prediction

ynew = model.predict_proba(Xnew)

# show the inputs and predicted probabilities

for i in range(len(Xnew)):

print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

Running the instance makes the probability predictions and then prints the input data instance and the probability of each instance belonging to the first and second classes (0 and 1).

X=[-0.79415228 2.10495117], Predicted=[0.94556472 0.05443528]
X=[-8.25290074 -4.71455545], Predicted=[3.60980873e-04 9.99639019e-01]
X=[-2.18773166 3.33352125], Predicted=[0.98437415 0.01562585]

X=[-0.79415228 2.10495117], Predicted=[0.94556472 0.05443528]

X=[-8.25290074 -4.71455545], Predicted=[3.60980873e-04 9.99639019e-01]

X=[-2.18773166 3.33352125], Predicted=[0.98437415 0.01562585]

This can be helpful in your application if you want to present the probabilities to the user for expert interpretation.

3. How to Predict With Regression Models

Regression is a supervised learning problem where, given input examples, the model learns a mapping to suitable output quantities, such as “0.1” and “0.2”, etc.

Below is an example of a finalized LinearRegression model. Again, the functions demonstrated for making regression predictions apply to all of the regression models available in scikit-learn.

# example of training a final regression model
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# generate regression dataset
X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=1)
# fit final model
model = LinearRegression()
model.fit(X, y)

# example of training a final regression model

from sklearn.linear_model import LinearRegression

from sklearn.datasets import make_regression

# generate regression dataset

X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=1)

# fit final model

model = LinearRegression()

model.fit(X, y)

We can predict quantities with the finalized regression model by calling the predict() function on the finalized model.

As with classification, the predict() function takes a list or array of one or more data instances.

Multiple Regression Predictions

The example below demonstrates how to make regression predictions on multiple data instances with an unknown expected outcome.

# example of training a final regression model
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# generate regression dataset
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# fit final model
model = LinearRegression()
model.fit(X, y)
# new instances where we do not know the answer
Xnew, _ = make_regression(n_samples=3, n_features=2, noise=0.1, random_state=1)
# make a prediction
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
for i in range(len(Xnew)):
	print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

# example of training a final regression model

from sklearn.linear_model import LinearRegression

from sklearn.datasets import make_regression

# generate regression dataset

X, y = make_regression(n_samples=100, n_features=2, noise=0.1)

# fit final model

model = LinearRegression()

model.fit(X, y)

# new instances where we do not know the answer

Xnew, _ = make_regression(n_samples=3, n_features=2, noise=0.1, random_state=1)

# make a prediction

ynew = model.predict(Xnew)

# show the inputs and predicted outputs

for i in range(len(Xnew)):

print("X=%s, Predicted=%s" % (Xnew[i], ynew[i]))

Running the example makes multiple predictions, then prints the inputs and predictions side-by-side for review.

X=[-1.07296862 -0.52817175], Predicted=-61.32459258381131
X=[-0.61175641 1.62434536], Predicted=-30.922508147981667
X=[-2.3015387 0.86540763], Predicted=-127.34448527071137

X=[-1.07296862 -0.52817175], Predicted=-61.32459258381131

X=[-0.61175641 1.62434536], Predicted=-30.922508147981667

X=[-2.3015387 0.86540763], Predicted=-127.34448527071137

Single Regression Prediction

The same function can be used to make a prediction for a single data instance as long as it is suitably wrapped in a surrounding list or array.

For example:

# example of training a final regression model
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# generate regression dataset
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# fit final model
model = LinearRegression()
model.fit(X, y)
# define one new data instance
Xnew = [[-1.07296862, -0.52817175]]
# make a prediction
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))

# example of training a final regression model

from sklearn.linear_model import LinearRegression

from sklearn.datasets import make_regression

# generate regression dataset

X, y = make_regression(n_samples=100, n_features=2, noise=0.1)

# fit final model

model = LinearRegression()

model.fit(X, y)

# define one new data instance

Xnew = [[-1.07296862, -0.52817175]]

# make a prediction

ynew = model.predict(Xnew)

# show the inputs and predicted outputs

print("X=%s, Predicted=%s" % (Xnew[0], ynew[0]))

Running the example makes a single prediction and prints the data instance and prediction for review.

X=[-1.07296862, -0.52817175], Predicted=-77.17947088762787

1	X=[-1.07296862, -0.52817175], Predicted=-77.17947088762787

Mitch Sanders April 6, 2018 at 5:42 am #

Once again, Jason… you’re answering all the questions that need answering.

I was just working through yesterday how to actually use these highly developed models (which I’ve learned to do expediently from your book by the way) to predict my new input variables. And here in my inbox, you’ve delivered this great article on it!

Thank you for making us all better at Machine Learning. You’re work here is stupendous and appreciated!

Reply
- Jason Brownlee April 6, 2018 at 6:37 am #
  
  Thanks Mitch, I’m glad it helps!
  
  Shoot/post me questions any time.
  
  Reply
  - Suresh Kumar April 25, 2018 at 12:10 am #
    
    sureshkumar0707@gmail.com
    
    Segmentation can be performed using Python? machine learning can be applied on the segmentation ?
    
    Reply
    - Jason Brownlee April 25, 2018 at 6:36 am #
      
      Yes, I recommend looking into OpenCV.
      
      Reply
  - Ashish Pandey July 15, 2019 at 10:21 pm #
    
    after getting my trained model i am not sure what exactly we are doing when making
    prediction on test like y_pred_m4 = lr_4.predict(X_test_m4)
    
    Reply
    - Jason Brownlee July 16, 2019 at 8:18 am #
      
      Yes, that is correct. Call the predict function to make a prediction.
      
      Reply
      - Gowtham May 8, 2020 at 6:45 am #
        
        while r_4.predict(X_test_m4) would be me the class with maximum probability, i would like to see the second highest and third highest probable classes.
      - Jason Brownlee May 8, 2020 at 7:59 am #
        
        You can use predict_proba() to get class membership probabilities.
        
        This is described in the above tutorial.
  - Rohan December 25, 2019 at 10:43 pm #
    
    If I want to predict the time in which a certain action will be completed.. What would you suggest which algorithm/module should I use?
    
    Reply
    - Jason Brownlee December 26, 2019 at 7:39 am #
      
      Probably model it as a time series classification task. E.g. “is the event expected to occur in the next interval or not?”
      
      Reply
    - Asif Rehman April 21, 2020 at 6:17 am #
      
      Dear Rohan/Jason…
      
      Can you give me working example of it …
      
      See my question of the same nature in the comments…I have created a car punctured data set in which a car got punctured on odd day of the month and I want to predict whether car would punctured on a given day? I am having problem and woudl be glad if you can guide.
      
      Reply
      - Jason Brownlee April 21, 2020 at 7:42 am #
        
        This process will help you work through your project:
        https://machinelearningmastery.com/start-here/#process
  - babar November 1, 2020 at 1:54 am #
    
    I have dataset
    
    I am calling the Fit function to learn the model,
    
    I am predicting using the Predict function of sklearn,
    
    But now i do not know how to predict the particular column.
    
    Like how to predict next ten days or next few years, what parameters should I pass to predict function to achieve that..
    
    Reply
    - Jason Brownlee November 1, 2020 at 7:34 am #
      
      Sounds like a time series problem, this will help:
      https://machinelearningmastery.com/multi-step-time-series-forecasting/
      
      And perhaps this:
      https://machinelearningmastery.com/multi-output-regression-models-with-python/
      
      Reply
  - Matthew February 11, 2023 at 8:59 pm #
    
    Hi Dr J, Brownlee,
    
    I am Matthew, a research student and I have been following your work for some time here.
    
    I have a question please. My research efforts to understand my challenge has been futile and I wish to grasp the understanding but I required further clarity into the issue to understand what is really happening.
    
    **The Issue That I am Trying To Understand**
    1. I have my video classification model of two classes 80 videos split Train80%: Val:10% Test:10% and it works ok
    
    1.a But, the challenge is that I am trying to see the predictions for every piece of test data with its video label being used in the operations. The amount of files for testing is 16 according to the split() function.
    
    2. I am not sure what this operations is called to achieve my objective and that increases the challenge
    2a. I would like to assume that it is around the train_test_split(), model.predict() or model.evaluate() function.
    
    3. To start, I did some digging into them (the functions) and thus far my efforts to understand has been futile, so I decided to follow along with your work again here, as I have been doing for the past yr. I must admit your explanations has been clearer than most communities.
    
    4. When I apply your methods using two classes, run and walk via this code:
    
    X_train, X_test, Y_train, Y_test = train_test_split( X, Y, train_size=0.8, test_size=0.2, random_state=1, stratify=Y, shuffle=True)
    history = model.fit(X_train, Y_train, validation_split=0.20, batch_size=args.batch,epochs=args.epoch, verbose=1, shuffle=True) predict_labels = model.predict(X_test, batch_size=args.batch, verbose=1)
    print('This is prediction labels',predict_labels) #THIS IS THE INDIVIDUAL LABELS HERE
    
    I get 16 outputs As per the amount of files in the split()==16:
    
    X=[0.6466933 0.35330668], Predicted=[0.6466933 0.35330668] X=[0.6468445 0.35315546], Predicted=[0.6468445 0.35315546] X=[0.6429217 0.35707828], Predicted=[0.6429217 0.35707828] X=[0.6474083 0.3525917], Predicted=[0.6474083 0.3525917] X=[0.6620246 0.3379754], Predicted=[0.6620246 0.3379754] X=[0.6466003 0.35339966], Predicted=[0.6466003 0.35339966] X=[0.6505527 0.34944734], Predicted=[0.6505527 0.34944734] X=[0.6593476 0.34065235], Predicted=[0.6593476 0.34065235] X=[0.6477448 0.3522552], Predicted=[0.6477448 0.3522552] X=[0.6466933 0.35330668], Predicted=[0.6466933 0.35330668] X=[0.6430916 0.35690835], Predicted=[0.6430916 0.35690835] X=[0.6512237 0.34877628], Predicted=[0.6512237 0.34877628] X=[0.6471297 0.3528703], Predicted=[0.6471297 0.3528703] X=[0.64262843 0.35737163], Predicted=[0.64262843 0.35737163] X=[0.6477448 0.3522552], Predicted=[0.6477448 0.3522552] X=[0.6566174 0.34338257], Predicted=[0.6566174 0.34338257]
    
    **What I Really Really Wish For is, Can you help me understand:=>:)**
    
    1. Why am I not seeing the labels of the video in the test data?
    I am not sure how to pass this in into the operations as I was somehow hoping that it was therein but it is not.
    
    2. Of the two classes, why is Walking scores higher or closer to 100% and Running is closer to zero%?
    3. Shouldn’t they all be in the ascending range of the prediction?
    for example walking reading 81% and running 90%
    
    3a. Can I make the output reflect this format specified in #3
    I know the split() shuffles the data but I cannot tell what data is being used for the test and it is this that I am really trying to find out.
    
    3b. What is the data being used? so that I can apply further analysis to fortify the model by removing certain videos due to its processing complexity.
    
    I tried the model.evaluate() function but that gave me only one overall score of the operation
    for example: acc = 56% loss 1.220233. (this is expected as I am testing the model on this procedure for now)
    
    4. My final output that I am trying to achieve is:
    This would allow me to see the data that is problematic in the test/ evaluate/ inference stages and I can re-process those for better performance.
    
    X=[0.6466933 0.35330668], Predicted=[0.6466933 0.35330668] vid46.mp4
    X=[0.6468445 0.35315546], Predicted=[0.6468445 0.35315546] vid27.mp4
    X=[0.6429217 0.35707828], Predicted=[0.6429217 0.35707828] vid09.mp4
    X=[0.6474083 0.3525917], Predicted=[0.6474083 0.3525917] vid41.mp4
    X=[0.6620246 0.3379754], Predicted=[0.6620246 0.3379754] vid21.mp4
    X=[0.6466003 0.35339966], Predicted=[0.6466003 0.35339966] vid52.mp4
    X=[0.6505527 0.34944734], Predicted=[0.6505527 0.34944734] vid45.mp4
    X=[0.6593476 0.34065235], Predicted=[0.6593476 0.34065235] vid18.mp4
    X=[0.6477448 0.3522552], Predicted=[0.6477448 0.3522552] vid26.mp4
    X=[0.6466933 0.35330668], Predicted=[0.6466933 0.35330668] vid36.mp4
    X=[0.6430916 0.35690835], Predicted=[0.6430916 0.35690835] vid22.mp4
    X=[0.6512237 0.34877628], Predicted=[0.6512237 0.34877628] vid35.mp4
    X=[0.6471297 0.3528703], Predicted=[0.6471297 0.3528703] vid10.mp4
    X=[0.64262843 0.35737163], Predicted=[0.64262843 0.35737163] vid12.mp4
    X=[0.6477448 0.3522552], Predicted=[0.6477448 0.3522552] vid43.mp4
    X=[0.6566174 0.34338257], Predicted=[0.6566174 0.34338257] vid23.mp4
    
    Please assist me to understand these bits, this is really a daunting task trying to understand this!
    I have been searching all over the place and then I remember that I can ask you a question here..
    really hoping for a response to my challenge.
    
    and thanks in advance for acknowledging my digital presence..
    
    have a great day..
    
    cheers Matt.
    
    Reply
    - James Carmichael February 12, 2023 at 9:24 am #
      
      Hi Matt…Please narrow your query to a single question related to our content so that we may be able to assist you.
      
      Reply
charles milam April 6, 2018 at 7:36 am #

How does one turn predictions into actions? Say I am predicting user fraud, how would you go about taking any given prediction point and determine the customer for that particular prediction.

Reply
- Jason Brownlee April 6, 2018 at 3:46 pm #
  
  Great question.
  
  The use of the predictive model would be embedded within an application that is aware of the current customer for which a prediction is being made.
  
  Reply
Jon April 6, 2018 at 11:06 am #

Great post. Love to see an example of the same in R.

Reply
- Jason Brownlee April 6, 2018 at 3:49 pm #
  
  Thanks for the suggestion.
  
  Reply
Jurek April 6, 2018 at 5:26 pm #

I like your explanation but I am missing one thing.
How do you encode and scale features for Xnew so they match trained data?

Reply
- Jason Brownlee April 7, 2018 at 6:10 am #
  
  You must sale new data using the procedure you used to scale training data.
  
  This might mean keeping objects or coefficients used to prepare training data to then apply on new data in the future, such as min/max, a vocab, etc. depending on the problem type.
  
  Reply
- Udayan October 7, 2019 at 3:59 pm #
  
  Hello,
  
  Could you explain how “predict_proba” calculate the probabilities for each class..assuming my model has 3 categorical variable as dependent variable. I tried with coefficient and exponential form plus adding with intercept, however it gives me a total of more than 1 and not matched with the output produce by predict_proba function as well.
  
  Reply
  - Jason Brownlee October 8, 2019 at 7:51 am #
    
    Each algorithm type will implement probability predictions differently.
    
    Sorry, I don’t understand your fault?
    
    Reply
    - Alberto October 30, 2019 at 5:45 am #
      
      I have the same problem.
      
      I train my logistic regressións with three classes, and everything works perfect. After i do log_proba with a train data and i get the probabilities of each class, perfect.
      
      But i need to implement this out of Python and Sklearn so what i do is:
      
      1- Take the coefficients of the logisticregression for each class
      2- Take one data from train
      3- multiply each feature of this data by his coefficients (Beta) and sum all for each class.
      4- After i plus the intercept for each class
      5- After i apply the sigmoid fucntion to each class to obtain the probability of being class 0, 1 or 2. But i don’t obtain the righst probabilities and the right class…
      
      Can you help me please Jason or Udayan? I don’t know what is bad…
      
      Thanks in advance!
      
      Reply
      - Jason Brownlee October 30, 2019 at 6:10 am #
        
        Good question.
        
        How different are the results? If the difference is small, it could just be rounding error.
        
        If it was me, I would trace the open source code and ensure my custom code was doing all the same operations. Look in the code – perhaps sklearn is doing other things?
Hazem April 19, 2018 at 8:38 am #

Thank you very much for the explanation
But my question is how to use the Source Code as an .exe Application to use it later without a script engine

Reply
- Jason Brownlee April 19, 2018 at 2:47 pm #
  
  You can use code in your application as you would any other software engineering project.
  
  I’m sorry, I am not an expert at creating executable files on Windows. I have not used the platform in nearly 2 decades.
  
  Reply
Nimish Bhandare May 3, 2018 at 1:39 am #

how to save a label encoder and reuse it again across different python files.

i have encoded my data in training phase and while i am trying to predict labels in testing phase i am not able to get same label encoders so i am getting wrong prediction
please help..

Reply
Nimish Bhandare May 3, 2018 at 2:21 am #

How to save (pickle) the LabelEncoder used to encode your y values when fitting your final model.

Reply
- Jason Brownlee May 3, 2018 at 6:36 am #
  
  You can use the pickle dump/load functions directly.
  
  Reply
Kevin Burke May 5, 2018 at 4:23 am #

Hi Jason, (relatively new to ML)

I have a data frame with,
1 ID column
6 feature columns
1 target column

when I train/test split the feature and target columns and do predictions etc, that is where I need to map back to the ID.

I want to be able to view something like this after my predictions:

A data frame with,
1 ID column
6 feature columns
1 target column
1 predicted column

Would you be able to help me with this? Really appreciate it,

kevin

Reply
- Jason Brownlee May 5, 2018 at 6:26 am #
  
  The predictions are made in the order of the inputs.
  
  You can take the array of predictions and align them with your inputs directly and start using them.
  
  Does that help? If not, what is the specific problem you are having?
  
  Reply
  - Kevin Burke May 5, 2018 at 5:45 pm #
    
    Thanks Jason, I suppose I have reached a point where I can get my final model and I cannot seem to find any information as to what next, i.e. real specifics regarding making predictions with new datasets.
    There are a trillion examples of how to work with train/test split and refining models, but my end goal is taking a ‘complete’ dataset, plugging it into my model prediction and producing back my initial ‘complete’ dataset PLUS my predicted column(s).
    I work for a credit union in DC and I have a list of member data, such things like member number, name, phone number, address, account balances and various other features I would use for prediction. I would like to feed this ‘complete’ dataset into my prediction model and have it spit out my initial ‘complete’ dataset PLUS my predicted column(s) that someone could then use to reach out marketing related messages to the member, depending on the prediction off course.
    
    Hope that makes sense..
    
    thanks again Jason, appreciate your time (how do you find the time?!!)
    
    Reply
    - Jason Brownlee May 6, 2018 at 6:25 am #
      
      Yes, that makes sense.
      
      You will need to write code to take the input to the model (X) and pass it to the model to make predictions model.predict(X) to get the prediction column (yhat).
      
      You then have the dataset X and the predictions yhat and the rows in one correspond to rows in the other. You can hstack() the arrays to have one large matrix of inputs and predictions.
      
      What problem specifically are you having in achieving this?
      
      Reply
      - Kevin Burke May 8, 2018 at 3:03 am #
        
        Thank you Jason, not so much a problem as a lack of experience trying to tie it all together, but with your help we’ll get there!!
        
        thanks again.
      - Jason Brownlee May 8, 2018 at 6:16 am #
        
        Hang in there Kevin!
- Laks April 3, 2019 at 6:23 pm #
  
  Exactly stuck with the same issue.
  
  If I dont supply ID column initially, since i have to make the machine learn ONLY the “Features”, there is no way i’m able to map the IDs back once i get the predicted result set, say y_pred from the test dataset.
  
  Reply
  - Jason Brownlee April 4, 2019 at 7:42 am #
    
    The input rows will be ordered the same as the output predictions. The mapping can be based on numpy array index.
    
    Reply
    - Laks April 5, 2019 at 4:59 pm #
      
      Thank you Jason. Will try.
      
      Need to mention, have learnt a lot from your articles. Thank you so much !
      
      Reply
      - Jason Brownlee April 6, 2019 at 6:42 am #
        
        Thanks.
    - Laks April 5, 2019 at 9:40 pm #
      
      Wanted to update that i was able to crack this at last. During the test / train split, on the left hand side, i had to include x_index, y_index and inside the train_test_split, i had to add dataset.index. Thats it. It works
      
      Reply
      - Jason Brownlee April 6, 2019 at 6:47 am #
        
        Glad to hear it.
      - crw July 18, 2019 at 5:06 am #
        
        Thanks Laks. I didn’t know that this would work.
Jorge June 4, 2018 at 7:05 am #

Hello Jason, I’ve got started working with scikit-learn models to predict further values but there is something I don’t clearly understand: Let’s suppose I do have a Stock Exchange price datasets with Date, Open Price, Close Price, and the variation rate from the previous date, for a single asset or position. In order to ‘fit’ a good prediction, I decided to use a Multiple Linear Regression and a Polynomial Feature also: I can obtain a formula even used a support vector machine (SVR) but I don’t know how to predict a NEW dataset, since the previous one has more than one variable (Open Price, Variation Rate, Date). How can I simulate further values?
Thanks for your response.

Reply
- Jason Brownlee June 4, 2018 at 2:36 pm #
  
  The tutorial above shows how to make a prediction with new data.
  
  What problem are you having exactly?
  
  Reply
Krushna Borkar June 5, 2018 at 11:08 pm #

Thank you so much, Jason, for this Great post! can you please tell me, I have used LabelEncoder for three Cities. So now I have to take input from a user as a string and convert them into int using LabelEncoder and provide it to trained model for prediction. Is it correct?

Reply
- Jason Brownlee June 6, 2018 at 6:41 am #
  
  I believe so.
  
  Reply
Nanna June 28, 2018 at 10:45 pm #

Hi Jason, always a pleasure seeing your blogs.

I’m thinking of a few things in regard to measuring the “accuracy” of a regression model and making use of such a model, would love to hear your thoughts.

I have problem that can either be framed as a classification problem (discrete labels) but also as a regression problem (a similar example could be price range or exact price ). After trying out a few models, I liked the use of a (random forest) regression model.

Besides evaluating the model on things like R^2 and RMSE I’m doing a sort of pseudo accuracy evaluation.

Say I have a prediction and a true value of

[19.8, 20]

So by true accuracy as in a classification problem the above is wrong, but if I define a new measure that tolerates answers within something fitting to the problem like +/- 2 or something like +/- 10% of the predicted value then the prediction is correct and the model will have greater accuracy. And then the prediction of a given sample would read something like x +/- y .

Or how would you display/interpret the predictions made by a regression model? Is it “correct” to measure the success as a pseudo accuracy as above? Or is it more correct and robust to express a prediction using e.g. RMSE as pred = x +/- RMSE ? Should I avoid this line of thinking when it comes to regression problems completely? And if such, how would I display my prediction of a given sample with a fitting confidence since the regression model typically is close but not always spot on the true value?

Reply
- Jason Brownlee June 29, 2018 at 6:11 am #
  
  It sounds like evaluating the mode as a regression model would be better.
  
  You would use MAE or RMSE to describe the expected error on average in the same units as the output variable.
  
  E.g. The RMSE is xx.x on average +/- x.
  
  You can use a confidence interval, I explain more here:
  https://machinelearningmastery.com/confidence-intervals-for-machine-learning/
  
  Reply
Harsha July 11, 2018 at 9:23 pm #

Hi Jason,

when I am assigning the X_Test to y_pred, it is returning the below shown error, can you please explain why?

y_pred = classifier.predict(X_Test)

Error:

NotFittedError Traceback (most recent call last)
in ()
—-> 1 y_pred = classifier.predict(X_Test)

C:\Anaconda\lib\site-packages\sklearn\neighbors\classification.py in predict(self, X)
143 X = check_array(X, accept_sparse=’csr’)
144
–> 145 neigh_dist, neigh_ind = self.kneighbors(X)
146
147 classes_ = self.classes_

C:\Anaconda\lib\site-packages\sklearn\neighbors\base.py in kneighbors(self, X, n_neighbors, return_distance)
325 “””
326 if self._fit_method is None:
–> 327 raise NotFittedError(“Must fit neighbors before querying.”)
328
329 if n_neighbors is None:

NotFittedError: Must fit neighbors before querying.

Reply
- Jason Brownlee July 12, 2018 at 6:24 am #
  
  It suggests that perhaps your model has not been fit on the data.
  
  Reply
Black Manga August 2, 2018 at 8:49 pm #

Any suggest how to eliminate predict data if predict data it’s far from data set which have been trained before. example i’m using SVM with label 1 : 4,4,3,4,4,3 label 2: 5,6,7,5,6,5 . and i’m predict data 20, i want the predict data (20) result is “not valid” or don’t show label 1 or 2.

Reply
- Jason Brownlee August 3, 2018 at 6:01 am #
  
  Sorry, I don’t follow. Are you able to give more context?
  
  Reply
  - Black Manga August 9, 2018 at 9:59 pm #
    
    sorry, if it’s doesn’t clear. Let say i want to make predict it’s apple or orange. suddenly i insert a grape data to predict into model that i have create(apple or orange). In my case the predict result will be apple or orange (i’m using SVM). so, how to know if my input data (grape data) it’s far different from the data train (apple and orange data). Thanks
    
    Reply
    - Jason Brownlee August 10, 2018 at 6:13 am #
      
      You might want to predict probabilities instead of classes and choose to not use a prediction if the predicted probabilities are below a threshold.
      
      Reply
      - Black Manga August 10, 2018 at 1:14 pm #
        
        i will try, thank you very much
Kuler Can August 2, 2018 at 11:35 pm #

Hello, I used scikit learn to predict google stock prices with MLPRegressor. How can I predict new values beyond dataset specially test data?

Reply
- Jason Brownlee August 3, 2018 at 6:03 am #
  
  The above post will help. That problem are you having exactly?
  
  Reply
Manas August 12, 2018 at 3:31 pm #

Hi Jason,

Can I fit a model by multiple K-Fold iteration for very unbalance class as shown below??

Could You kindly help on this!

for val in range(0,1000): #total sample is 20k majority class and 20 minority class
balanced_copy_idx=balanced_subsample(labels,40) #creating each time randomly 20Minority class and 20 majority class
X1=X[balanced_copy_idx]
y1=y[balanced_copy_idx]

kf = KFold(y1.shape[0], n_folds=10,shuffle= True,random_state=3)
for train_index, test_index in kf:

X_train, y_train = X1[train_index], y1[train_index]
X_test, y_test = X1[test_index], y1[test_index]

vectorizer = TfidfVectorizer(max_features=15000, lowercase = True, min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words=’english’)

train_corpus_tf_idf = vectorizer.fit_transform(X_train)
test_corpus_tf_idf = vectorizer.transform(X_test)

model1 = LogisticRegression()
model1.fit(train_corpus_tf_idf,y_train)

Reply
- Jason Brownlee August 13, 2018 at 6:15 am #
  
  Yes you can.
  
  Sorry, I cannot review and debug your code, perhaps post on stackoverflow?
  
  Reply
- Andrew January 5, 2022 at 4:06 pm #
  
  Hi Jason, I have a database of music that I need to shuffle by genre, I also have a trained machine learning model, how can I bring the machine’s predictions into a genre if it only gives me arrays?
  
  Reply
Gabriel Joshua Migue September 6, 2018 at 1:46 am #

What is the purpose of random state? when i try to run my prediction the accuracy is not stable but when i input the random state = 0 it gives stable prediction but low accuracy when i change the random state to 100 it give me higher accuracy

Reply
- Jason Brownlee September 6, 2018 at 5:40 am #
  
  To fix the pseudo random number generator.
  
  You can learn more about randomness in machine learning here:
  https://machinelearningmastery.com/faq/single-faq/what-value-should-i-set-for-the-random-number-seed
  
  Reply
  - Gabriel Joshua Migue September 6, 2018 at 7:05 pm #
    
    Thanks for fast reply more power god bless 🙂
    
    Reply
Ana September 6, 2018 at 1:51 pm #

Hi Jason, thank you for your always useful and insightful posts!
One question that seems to be a recurrent issue regarding predict_proba(). For which sklearn models can it be used? Eg. can it be used for logistic regression, SVM, naive Bayes and random forest? I was playing with it recently for both binary and multiclass classification and it seemed to be producing the following paradox: probability vectors for each sample, in which the smallest probability was assigned to the class that was actually being predicted. Is it possible that predict_proba() generates (1-prob) results?

Reply
- Jason Brownlee September 6, 2018 at 2:15 pm #
  
  Not all, some don’t support predicting probabilities natively and some that don’t may use a decision_function() instead.
  
  Reply
Scriptkidd September 11, 2018 at 12:19 am #

How can I make the prediction more detailed? Like say I input a hibiscus flower into this model, instead of probabilities, I want to get something like “input not a Iris, it was off by blahblahblah”, and probably take a decision. I think that’s what Black Manga meant in his comments above

Reply
- Jason Brownlee September 11, 2018 at 6:30 am #
  
  You would have to write this “interpretation” yourself.
  
  Reply
Sintyadi Thong September 17, 2018 at 2:18 pm #

Hi, Jason… It is a great article indeed.
I have a question,
I have trained my models and have saved the model.

The next part that I would like to put it into production to build a .py file which function is only to predict the given sets of parameters.

How should I code my predict.py file so using command line, I can just input some variables and get the output.

Should I also add the import functions inside my predict.py?

Thanks in advance!

Reply
- Jason Brownlee September 18, 2018 at 6:09 am #
  
  This sounds like a software engineering question, rather than an ML question.
  
  You can use Python functions to read input and write output from your script. Perhaps check the Python standard API or a good reference text?
  
  Reply
MadTech October 10, 2018 at 5:13 pm #

Thanks for the attempt, but unfortunately, I did not find this post very helpful because it failed to present *simple* examples (IMO). E.g. if the distinction between Classification Models and Regression Models is paramount, please include a link that sheds light on it. Missing here are useful demonstrations on how to perform the *simplest* possible prediction. E.g. a data set as straightforward as world population/year or home price/bathrooms: show how to load the data, then “ask” the algorithm for a prediction for a specific value, e.g. what will the world population be in 2020? What is the predicted home value of a house with 3 bathrooms? Something simpler than unexplained complex variable of Xnew = [[-1.07296862, -0.52817175]] — sorry, I don’t have any idea what that is.

I know I’m new to ML, but I feel this post could be far more useful if it tackled the simplest possible example and then transitioned up to what is here. Examples examples examples: those are the only things that really matter.

Reply
- Jason Brownlee October 11, 2018 at 7:48 am #
  
  Great feedback, thanks.
  
  I guess this post is not for absolute beginners, but rather those that are using the sklearn API and looking to better understand the predict() functions.
  
  This post explains the difference between classification and regress clearly:
  https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
  
  This post provides a gentle introduction to working through a project end to end:
  https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
  
  Does that help?
  
  Reply
Abigail November 20, 2018 at 12:08 am #

Dear Jason, Thank you very much for the great posts. I finalize my model and now I want to train the model with the X_validation data. I want to export my results in a csv file. How can I change my code to have a csv file with 3 columns. The ID of the value(1st column), the real label of the value (2nd column), and the prediction in the last column?

predictions = model.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
prediction_df = pd.DataFrame(predictions)

prediction_df.to_csv(‘result.csv’)

Reply
- Jason Brownlee November 20, 2018 at 6:36 am #
  
  You can create the datastructure you require in memory as numpy arrays or dataframes, then save it to CSV.
  
  Reply
Sujan January 17, 2019 at 10:55 pm #

Hello Jason,

Can we predict answers ( not classes), i see using scikit-learn we are classifying the data into new classes and we are predicting the class like SPAM, HAM etc ..

How we goona approach when we have to predict or fetch an answer ( from existing repository) , do i need to train the model with answers ( long answers not only just classification )

Thanks
Sujan

Reply
- Jason Brownlee January 18, 2019 at 5:37 am #
  
  If you need to output text, you may need a text generation algorithm. You can get started with this here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
manu February 8, 2019 at 2:53 pm #

Sir,

In my dataset, i have 25000 records which contains one input value and three output values.
How can I make prediction of 3 output variables using this only one input variable?

Reply
- Jason Brownlee February 9, 2019 at 5:53 am #
  
  Yes, perhaps try a neural network given that a vector prediction is required.
  
  Reply
  - manu February 11, 2019 at 2:32 pm #
    
    Sir,
    
    Is there any sample code or web links for that concept?
    
    Reply
    - Jason Brownlee February 12, 2019 at 7:50 am #
      
      Yes, I’m sure I have a few examples on the blog.
      
      You can achieve the result by setting the number of nodes in the output layer to the size of the vector required.
      
      Reply
      - manu February 12, 2019 at 7:35 pm #
        
        Sir,
        I have searched in your blog. But I couldnt found the examples. Sir, can you please send me the link?
      - Jason Brownlee February 13, 2019 at 7:56 am #
        
        Sure, see the “vector output” example in this post:
        https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/
      - manu February 13, 2019 at 6:44 pm #
        
        Sir,
        Thank you for your reply.
        
        Sir, my data is not a time series data.
        The data set contains three input variables and a corresponding output variable(output value is determined by a software by changing the input variables).
        
        So, my problem is: if a user gives a target output value, the model should predict the corresponding input variables. It is not a time series data,Sir,
      - Jason Brownlee February 14, 2019 at 8:40 am #
        
        Understood, but you can use the vector output example to get started.
      - manu February 13, 2019 at 7:00 pm #
        
        Dataset looks like this Sir,
        
        input 1 | input 2 | input 3| output
        
        150 | 2.356 | 10000 | 4.56
        
        (output is predicted using a software which takes input 1, input 2 and input 3 as inputs)
        
        The problem is: If a user gives a target output, my model should predict the corresponding input1, input2 and input3 values.
Mathan February 11, 2019 at 8:02 pm #

Hi Jason, Do different algorithms take different time while doing batch prediction (let’s assume for 1000 rows)? If yes, I would like to know your view on why do they take different time.Could you please help me to clear this doubt? Thanks in advance.

Reply
- Jason Brownlee February 12, 2019 at 7:56 am #
  
  Yes, different algorithms may have different computational complexity when making a prediction.
  
  Reply
raghuram February 20, 2019 at 6:36 pm #

Hi Jason

I have scaled X_Train and fit model with X_Train, Y_train. When i predict a single value then Output (Predicted value ) is according to the Scaled X_Train Input. I am not able to map the scaled Input and Original Output and not able to map the predicted output to which original(before scale) it belongs to.

How can i refer back to the Original value and its target value and confirm they in sync with Scaled Input and its output ?

Reply
- Jason Brownlee February 21, 2019 at 7:53 am #
  
  You can save the instance of the object used to perform the transform, then call the inverse_transform function.
  
  Reply
rags March 5, 2019 at 5:48 pm #

Hi all, I have a survey data and using it for the purpose of historic data , I want to implement ML for predicting the answer to next question by the respondent.

what approach should i Follow?

Reply
- Jason Brownlee March 6, 2019 at 7:44 am #
  
  Good question, perhaps this will help:
  https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use
  
  Reply
Astha March 10, 2019 at 6:26 pm #

Thank you so much.This post was of great help!

Reply
- Jason Brownlee March 11, 2019 at 6:49 am #
  
  I’m happy it helped.
  
  Reply
esraa March 21, 2019 at 3:27 am #

thank you, how I can count accuracy for prediction

Reply
- Jason Brownlee March 21, 2019 at 8:21 am #
  
  You don’t, instead you calculate error.
  
  See this post:
  https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/
  
  Reply
Priyanka Dave March 28, 2019 at 10:04 pm #

Hi Jason,

In Random Forest Regression model, is it possible to give specific numeric range for each sample and we need output within that range only. Range can be different for each inputs.

Thanks.

Reply
- Jason Brownlee March 29, 2019 at 8:34 am #
  
  Perhaps you could achieve that with post-processing of the prediction from the model? Like re-scaling?
  
  Reply
Priyanka Dave March 29, 2019 at 5:21 pm #

Thanks Jason for your reply and you are doing great job.

Post processing or re-scaling would be the proper solution?
I have multiple inputs & multiple outputs. Can I tell model that these particular inputs will have more impact on these particular output ?

Reply
- Jason Brownlee March 30, 2019 at 6:23 am #
  
  You could use feature selection and/or feature importance methods to give a rough idea of relative feature impact on model skill.
  
  Reply
  - Priyanka Dave April 1, 2019 at 8:37 pm #
    
    Thanks Jason.
    
    Reply
Fahim April 5, 2019 at 10:57 pm #

Hi Jason, I hope you would see this quickly since it’s urgent.
After training a logistic regression model from sklearn on some training data using train_test split,fitting the model by model.fit(), I can get the logistic regression coefficients by the attribute model.coef_ , right? Now, my question is, how can I use this coefficients to predict a separate, single test data?

Reply
- Jason Brownlee April 6, 2019 at 6:49 am #
  
  You can use the model directly via model.predict()
  
  There are more examples here:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
ramakant kumar April 22, 2019 at 3:12 pm #

“””
Random Forest implementation with CART decision trees
This version is for continuous dataset (feature values)

Author: Jamie Deng
Date: 09/10/2018
“””

import numpy as np
import pandas as pd
from collections import Counter
from sklearn.utils import resample
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from multiprocessing import Pool
from random import sample
import time
from sklearn.ensemble import RandomForestRegressor

np.seterr(divide=’ignore’, invalid=’ignore’) # ignore Runtime Warning about divide

class TreeNode:
def __init__(self, n_features):
self.n_features = n_features
self.left_child = None
self.right_child = None
self.split_feature = None
self.split_value = None
self.split_gini = 1
self.label = None

def is_leaf(self):
return self.label is not None

“”” use 2d array (matrix) to compute gini index. Numerical feature values only “””
def gini(self, f, y, target):
trans = f.reshape(len(f), -1) # transpose 1d np array
a = np.concatenate((trans, target), axis=1) # vertical concatenation
a = a[a[:, 0].argsort()] # sort by column 0, feature values
sort = a[:, 0]
split = (sort[0:-1] + sort[1:]) / 2 # compute possible split values

left, right = np.array([split]), np.array([split])
classes, counts = np.unique(y, return_counts=True)
n_classes = len(classes)
# count occurrence of labels for each possible split value
for i in range(n_classes):
temp = a[:, -n_classes + i].cumsum()[:-1]
left = np.vstack((left, temp)) # horizontal concatenation
right = np.vstack((right, counts[i] – temp))

sum_1 = left[1:, :].sum(axis=0) # sum occurrence of labels
sum_2 = right[1:, :].sum(axis=0)
n = len(split)
gini_t1, gini_t2 = [1] * n, [1] * n
# calculate left and right gini
for i in range(n_classes):
gini_t1 -= (left[i + 1, :] / sum_1) ** 2
gini_t2 -= (right[i + 1, :] / sum_2) ** 2
s = sum(counts)
g = gini_t1 * sum_1 / s + gini_t2 * sum_2 / s
g = list(g)
min_g = min(g)
split_value = split[g.index(min_g)]
print(“hello0”)
return split_value, min_g

def split_feature_value(self, x, y, target):
# compute gini index of every column
n = x.shape[1] # number of x columns
sub_features = sample(range(n), self.n_features) # feature sub-space
# list of (split_value, split_gini) tuples
value_g = [self.gini(x[:, i], y, target) for i in sub_features]
result = min(value_g, key=lambda t: t[1]) # (value, gini) tuple with min gini
feature = sub_features[value_g.index(result)] # feature with min gini
return feature, result[0], result[1] # split feature, value, gini

# recursively grow the tree
def attempt_split(self, x, y, target):
c = Counter(y)
majority = c.most_common()[0] # majority class and count
label, count = majority[0], majority[1]
if len(y) 0.9: # stop criterion
self.label = label # set leaf
return
# split feature, value, gini
feature, value, split_gini = self.split_feature_value(x, y, target)
# stop split when gini decrease smaller than some threshold
if self.split_gini – split_gini < 0.01: # stop criterion
self.label = label # set leaf
return
index1 = x[:, feature] value
x1, y1, x2, y2 = x[index1], y[index1], x[index2], y[index2]
target1, target2 = target[index1], target[index2]
if len(y2) == 0 or len(y1) == 0: # stop split
self.label = label # set leaf
return
# splitting procedure
self.split_feature = feature
self.split_value = value
self.split_gini = split_gini
self.left_child, self.right_child = TreeNode(self.n_features), TreeNode(self.n_features)
self.left_child.split_gini, self.right_child.split_gini = split_gini, split_gini
self.left_child.attempt_split(x1, y1, target1)
self.right_child.attempt_split(x2, y2, target2)

# trance down the tree for each data instance, for prediction
def sort(self, x): # x is 1d array
if self.label is not None:
return self.label
if x[self.split_feature] <= self.split_value:
return self.left_child.sort(x)
else:
return self.right_child.sort(x)

class ClassifierTree:
def __init__(self, n_features):
self.root = TreeNode(n_features)

def train(self, x, y):
# one hot encoded target is for gini index calculation
# categories='auto' silence future warning
encoder = OneHotEncoder(categories='auto')
labels = y.reshape(len(y), -1) # transpose 1d np array
target = encoder.fit_transform(labels).toarray()
self.root.attempt_split(x, y, target)

def classify(self, x): # x is 2d array
return [self.root.sort(x[i]) for i in range(x.shape[0])]

class RandomForest:
def __init__(self, n_classifiers=30):
self.n_classifiers = n_classifiers
self.classifiers = []
self.x = None
self.y = None

def build_tree(self, tree):
n = len(self.y) # n for bootstrap sampling size
# n = int(n * 0.5)
x, y = resample(self.x, self.y, n_samples=n) # bootstrap sampling
tree.train(x, y)
return tree # return tree for multiprocessing pool

def fit(self, x, y):
self.x, self.y = x, y
n_select_features = int(np.sqrt(x.shape[1])) # number of features
for i in range(self.n_classifiers):
tree = ClassifierTree(n_select_features)
self.classifiers.append(tree)
# multiprocessing pool
pool = Pool()
self.classifiers = pool.map(self.build_tree, self.classifiers)
pool.close()
pool.join()

def predict(self, x_test): # ensemble
pred = []

for tree in self.classifiers:
print(self.classifiers)
y_pred = tree.classify(x_test)
pred.append(y_pred)
pred = np.array(pred)
result = [Counter(pred[:, i]).most_common()[0][0] for i in range(pred.shape[1])]
return result

def test():
start_time = time.time()
# It's a continous dataset, only numerical feature values
df = pd.read_csv('waveform.data', header=None, sep=',')
print("hello0")
data = df.values
x = data[:, :-1]
y = data[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
rf = RandomForest(n_classifiers=30)
print("hello1")
rf.fit(x_train, y_train)
print("hello2")
y_pred = rf.predict(x_test)
print("hello")
acc = accuracy_score(y_test, y_pred)
print('RF:', acc)
print("— Running time: %.6f seconds —" % (time.time() – start_time))

if __name__ == "__main__":
test()

gives error .thank adv

File "C:/Users/Rama/rf.py", line 168, in predict
y_pred = tree.classify(x_test)

AttributeError: 'str' object has no attribute 'classify'

Reply
- Jason Brownlee April 23, 2019 at 7:50 am #
  
  Sorry, I don’t have the capacity to debug your code. I explain more here:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
Sandeep April 26, 2019 at 10:47 pm #

Hi Jason,
I need to predict the order cancellation probability based upon some 6 independent variables. Now, through .coef_ function I can find out the coefficient values for each independent variables, but that will be for the whole dataset. I wan’t to find out that, if my model is predicting that order will get cancel than what is the reason behind it, I mean which parameter is the most prominent reason for the cancellation of that particular order.

Reply
- Jason Brownlee April 27, 2019 at 6:31 am #
  
  Perhaps use a decision tree model that can explain the predictions it makes?
  
  Reply
naveen April 30, 2019 at 4:13 pm #

Hi Jason,

I have to print top three probabilities and map to class labels and compare with actual for multi class classification can u pls provide me the code.

Thanks in adv.

Regards,
NaVeen

Reply
- Jason Brownlee May 1, 2019 at 6:59 am #
  
  What problem are you having exactly?
  
  Reply
Nassim May 20, 2019 at 7:35 pm #

Hi Jason,

I have a project how is making a prediction of availability places of parking, I have in my database a geolocaloisation point, please can you help me to say how I can use this data to predict and how I begin, what kind of algorithm I will use.

Thanks in adv

Regards,
Nassim.

Reply
- Jason Brownlee May 21, 2019 at 6:33 am #
  
  I recommend this process:
  https://machinelearningmastery.com/how-to-define-your-machine-learning-problem/
  
  Reply

test11 May 22, 2019 at 4:49 pm #

Hi Jason,

I am trying to test my model on unseen data from another file, so far I have my main Python file below;

file1.py

def model_test(x, y):
    start2 = timer()


    pipeline = Pipeline([

        ('vec', HashingVectorizer()),
        ('clf', LogisticRegression()),
    ])


    param_grid = {
        'vec__strip_accents': ['ascii'], 'vec__ngram_range': [(1, 1)],
        'vec__encoding': ['latin-1'], 'vec__stop_words': ['english'], 'vec__lowercase': ['False'],
        'clf__random_state': [1], 'clf__solver': ['lbfgs'], 'clf__max_iter': [200],

    }

    grid = GridSearchCV(pipeline, cv=2, param_grid=param_grid, n_jobs=-1, iid=True, return_train_score=True, refit=True, verbose=1)
    grid.fit(x, y)


model_test(X_train, y_train)

def model_test(x, y):

start2 = timer()

pipeline = Pipeline([

('vec', HashingVectorizer()),

('clf', LogisticRegression()),

])

param_grid = {

'vec__strip_accents': ['ascii'], 'vec__ngram_range': [(1, 1)],

'vec__encoding': ['latin-1'], 'vec__stop_words': ['english'], 'vec__lowercase': ['False'],

'clf__random_state': [1], 'clf__solver': ['lbfgs'], 'clf__max_iter': [200],

}

grid = GridSearchCV(pipeline, cv=2, param_grid=param_grid, n_jobs=-1, iid=True, return_train_score=True, refit=True, verbose=1)

grid.fit(x, y)

model_test(X_train, y_train)

And on my test file for unseen data I have;

from file1 import model_test

y_new = model_test.predict(X_new)
model_test(X_new, y_new)

It is saying that it cannot find reference predict in function, would you know how to get around this issue? Thank you.

Jason Brownlee May 23, 2019 at 5:55 am #

You do not need grid search to make a prediction, grid search is only used to find a configuration for your final model.

See this post:
https://machinelearningmastery.com/train-final-machine-learning-model/

Reply
- test11 May 24, 2019 at 4:25 am #
  
  This makes sense to me now. Thanks for getting back to me Jason!
  
  Reply
  - Jason Brownlee May 24, 2019 at 7:59 am #
    
    No problem.
    
    Reply

Soumen Sarkar June 18, 2019 at 3:46 pm #

hi Jason,

thanks for excellent post. i was wondering after predicting value on already known y, how to use regression model to predict unknown y for given x set. in your coding, Make_regression – does it really use my actual data point? or it is a random regression based on random data points depending upon number of sample, features etc.
Suppose i have built a model and predicted number of immigrants basis different x features. now i want to predict number of immigrants given new set of feature value. Do you think Make-regression will be the right option? The reason i am asking this is the function doesn’t ask about my X and y – am i wrong here? Your early reply will be really appreciated

Reply
- Jason Brownlee June 19, 2019 at 7:49 am #
  
  If you change the features – the inputs – then you must fit a new model to map those inputs to the desired outputs.
  
  Does that help?
  
  Reply
Ibrahim H. July 8, 2019 at 9:14 am #

Hi,
Thanks for this great tutorial, the new version show a warning, so it requires to set solver param:
model = LogisticRegression(solver="lbfgs")`

Reply
- Jason Brownlee July 8, 2019 at 1:50 pm #
  
  Thanks.
  
  Reply
Dio July 13, 2019 at 6:14 am #

Hi Jason,

I want my prediction to generate the name of the class instead of numbers. How can I do so? For example, in the case of the iris flowers, how can I input the data and make the model predict the number after I have trained and validated the model.

Reply
- Jason Brownlee July 13, 2019 at 7:01 am #
  
  Each class is mapped to an integer using a LabelEncoder.
  
  You can use inverse_transform to map integers batch to class names.
  
  Reply
Elshrif July 14, 2019 at 2:18 pm #

Hi Jason,

How to solve this Error

# All needed imports
import pandas as pd
import numpy as np
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import seaborn as sns
from surprise import *
from surprise.accuracy import rmse, mae, fcp
from UserDefinedAlgorithm import PredictMean
from sklearn.grid_search import ParameterGrid
import pickle
from os import listdir

—————————————————————————
ImportError Traceback (most recent call last)
in ()
7 from surprise import *
8 from surprise.accuracy import rmse, mae, fcp
—-> 9 from UserDefinedAlgorithm import PredictMean
10 from sklearn.grid_search import ParameterGrid
11 import pickle

ImportError: No module named UserDefinedAlgorithm

Reply
- Jason Brownlee July 15, 2019 at 8:16 am #
  
  I have not seen the error or the surprise library before, perhaps try posting to stackoverflow?
  
  Reply
Aditya August 2, 2019 at 7:43 am #

What if the X_train and X_new are of different sizes

I’m getting the following error after using Tfidf with similar parameters on X_new.
X has 265 features per sample; expecting 73

Reply
- Jason Brownlee August 2, 2019 at 2:34 pm #
  
  Input samples tot he model must always have the same size.
  
  This may require that you preserve and reuse the same data preparation procedures/objects.
  
  Reply
Pramod Kumar August 11, 2019 at 6:00 pm #

Hi Jason,

I have created a random forest model and evaluated the model using confusion matrix.
Now i want to predict the output by supplying some specific independent variable values

Ex: Age, Estimated Salary, Gender_Male, Gender_Female and ‘Purchased’ are my columns.

now i want to predict the output (Purchased) when the input variable are Age==35, Gender_Male=1 and Estimated salary= 40000

Can you please help me with the code.

Thanks in advance!

Regards,
Pramod Kumar

Reply
- Jason Brownlee August 12, 2019 at 6:34 am #
  
  The examples in the above tutorial will provide a useful starting point.
  
  What problem are you having precisely?
  
  Reply
Moahmmad August 14, 2019 at 7:48 am #

Hi Jason

I have stock market data with some features let’s say date,open,high,low,close,F1,F2,F3

my x_train is the data without ‘close’ column, and my y_train is the ‘close’ column
same for the x_test, and y_test

now when I do LinearRegression.fit(x_train,y_train) then predicted_y = LinearRegression.predict(x_test)

I get good results and near my y_test.

my problem is if I want to make prediction for tommorow, I don’t have any feature, all the columns are unknown, so how can I make a prediction?

thanks

Reply
- Jason Brownlee August 14, 2019 at 2:10 pm #
  
  You can make a prediction as follows:
  
  yhat = model.predict(newX)
  
  Where newX is the input data required by the model as you designed it to make a one-step prediction.
  
  Reply
Bob August 15, 2019 at 3:22 pm #

I engineered a bunch of features in my X_train and fit a model on X_train and y_train. Now when I want to predict on X_test, why does X_test have to have the same columns as X_train? I don’t understand this. If the the model learns from the past, why does it have to have those columns in the testing set?

Reply
- Jason Brownlee August 16, 2019 at 7:45 am #
  
  Yes, you are defining a model to take specific input features (columns) that must be consistent during training and testing.
  
  Perhaps this will help:
  https://machinelearningmastery.com/how-machine-learning-algorithms-work/
  
  Reply
Bibin Antony September 25, 2019 at 1:13 am #

how to apply for a text data?

Reply
- Jason Brownlee September 25, 2019 at 5:59 am #
  
  You can see the examples here:
  https://machinelearningmastery.com/start-here/#nlp
  
  Reply
shadia October 21, 2019 at 4:42 am #

hi Jason
i’m confused , what is the final step of my project, saving the model or finalize the model by fitting on all data

Reply
- Jason Brownlee October 21, 2019 at 6:23 am #
  
  Fit the model on all available data, save it. Then in the future load it to make predictions on new data.
  
  More here:
  https://machinelearningmastery.com/train-final-machine-learning-model/
  
  Reply
Jose G Rosas November 5, 2019 at 3:03 pm #

Awesome!
Thanks!

Reply
- Jason Brownlee November 6, 2019 at 6:29 am #
  
  You’re welcome.
  
  Reply
Daniele December 5, 2019 at 3:34 am #

Cio Jason
what is best for porting in csv or xlsx the scores of the Xnew classification?

also: can I have Xnew values stored initially in xlsx (or csv)?

I am referring to this section of yours..

…(single classification)

Xnew = [[-0.79415228, 2.10495117]]
# make a prediction
ynew = model.predict(Xnew)
print(“X=%s, Predicted=%s” % (Xnew[0], ynew[0]))

…

instead of printing the predicted class for the array Xnew I need to write automatically into a file (csv or xlsx) the results …so as an application then can then consume such such scores…

any advice?

Daniele

thanks for your great work!

Reply
- Jason Brownlee December 5, 2019 at 6:43 am #
  
  I believe you can load a CSV directly in excel.
  
  Reply
Daniele December 5, 2019 at 7:39 am #

actually I got this situation:

……
model = LinearRegression()
model.fit(X_train, Y_train)
result_train =model.score(X_train, Y_train)

result_train.to_excel(“new.xlsx”, sheet_name = “second”) # and here i get this warning…:

AttributeError: ‘numpy.float64’ object has no attribute ‘to_excel’

not sure I to solve though….maybe is related to the yhat use…stuck.

please : )

Reply
- Jason Brownlee December 5, 2019 at 1:19 pm #
  
  Perhaps try saving your data to a CSV file first?
  
  Reply
  - Daniele December 5, 2019 at 11:31 pm #
    
    Hi Jason!
    
    the CSV solution did not worked for me but reviewing some of your previous threads and other stuff I found a solution to pipeline the prediction results directly into a xlsx or csv like this:
    
    >>>model.fit(X_train, Y_train)
    
    >>>result_train =model.score(X_train, Y_train)
    
    # generate accuracy prediction as dataframe object
    >>>Accuracy = pd.DataFrame({‘ACCURACY’ : [result_train]})
    
    Hope that the last line of code could be a solution wotking for others as well
    
    ciao Grande Jason e Grazie ancora.
    
    Reply
    - Jason Brownlee December 6, 2019 at 5:18 am #
      
      Well done!
      
      Reply
babar ali shah February 18, 2020 at 6:13 am #

Hi @Jason Sir
I have started prediction using neural network for last a month, i have made huge research n study but still unable to understand the working functions used in Neural networks, e,g; Sequential(), dense() etc..
Waiting for your kind response please…

Reply
- Jason Brownlee February 18, 2020 at 6:24 am #
  
  Well done, start here:
  https://machinelearningmastery.com/start-here/#deeplearning
  
  Reply
Ehsan February 18, 2020 at 10:29 pm #

Hi Jason,

Do you know how to send below result to a CSV or excel?
for i in range(len(Xnew)):
print(“X=%s, Predicted=%s” % (Xnew[i], ynew[i]))

Reply
- Jason Brownlee February 19, 2020 at 8:04 am #
  
  Yes, see this:
  https://machinelearningmastery.com/how-to-save-a-numpy-array-to-file-for-machine-learning/
  
  Reply
Shreya February 25, 2020 at 5:48 pm #

My dataset is of huge size, due to which its taking too long if I am using SVC with kernel = ‘linear’ for training.
If I use LinearSVC, then though training is not taking that much time, but further the problem with this is that I cannot calculate the prediction probabilities using predict_proba() function when I am ensembling the models such as Random Forest, XGBoost and LinearSVC.
What could be the solution if I want to use the SVM Classifier in ensembling ??

Reply
- Jason Brownlee February 26, 2020 at 8:15 am #
  
  Use less training data.
  Use a faster machine.
  Use a simpler model.
  
  Reply
kiki March 13, 2020 at 11:52 am #

Hi. Do you have an idea to do the prediction that the outcome is the category type. I already done label encoder and it execute a value. But the prediction that I want to do it the type which the outcome will have the value 0 until 6. Is it the same way to do the coding that the outcome will be 0 and 1? Sorry for asking cause I’m the beginner student in machine learning. Thanks ya .

Reply
- kiki March 13, 2020 at 12:30 pm #
  
  I think I already found the notes from your web. It helps me to understand.
  https://machinelearningmastery.com/imbalanced-multiclass-classification-with-the-glass-identification-dataset/
  Thanks Jason.
  
  Reply
- Jason Brownlee March 13, 2020 at 1:49 pm #
  
  It sounds like you are describing a multi-class classification task.
  
  You most likely will beed to one hot encode your target variable first:
  https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
  
  Reply
Sousa March 20, 2020 at 1:55 am #

Hello Sir. We do all the steps for building a model (data prepare, feature selection, model selection, hyper parametrization, …), to improve and discover the best model for our problem. In the final, after we discover that model, we fit all the data on that model to making predictions. Am I right?

My doubt is: we will need to prepare the new data (to make a prediction) for the model like we prepare for the training/test? Because for example in the model training/test we may use getDummies, so if we collect the same features as the original CSV and then we want to make a prediction, the number of features will be different. The same happens with normalization/standardization. This new data, will not be normalized/standardized.

Thanks in advance!

Reply
- Jason Brownlee March 20, 2020 at 8:47 am #
  
  Yes.
  
  Correct. See this:
  https://machinelearningmastery.com/how-to-save-and-load-models-and-data-preparation-in-scikit-learn-for-later-use/
  
  Reply
d April 10, 2020 at 11:03 am #

Hi Jason, your article really helps me. From your article, we can load and save the file with sckit learn, jason and yaml. So if we want to do the prediction using the model, is it the same way as we do with sckit learn?

Reply
- Jason Brownlee April 10, 2020 at 1:25 pm #
  
  Yes, this tutorials shows you how to load a file:
  https://machinelearningmastery.com/load-machine-learning-data-python/
  
  Reply
  - d April 10, 2020 at 2:23 pm #
    
    What I mean is save the model with sckit learn, jason and yaml. So if we want to do the prediction using the model that has been saved, is it the same way as we do with sckit learn?Sorry for asking. 🙁
    
    Reply
    - Jason Brownlee April 10, 2020 at 3:34 pm #
      
      Perhaps, if that is possible, I don’t have an example.
      
      Reply
Asif Rehman April 21, 2020 at 5:46 am #

Respected Sir…Its great article that gives the start up of multiple ML model.

Sir I want to apply “Binary Classification” on a car that got punctured on odd of the month but remain safe on even day of the month. The features of the data set is as below:

CarID LocationID Day Month Year isPunctured
1 100 1 1 2020 1
1 100 2 1 2020 0

I make the data set of just one month and I have used two models “GuassianNB” and “LogisticRegression” but both gives me accuracy of just 18%.

df=pd.read_csv(‘CarTrain.csv’)
label=df[“Punctured”]
features=df[[‘CarID’,’LocID’,’Day’,’Month’,’Year’]]

train, test, train_labels, test_labels = train_test_split(features,
label,
test_size=0.33,
random_state=42)
# Initialize our classifier

#gnb = GaussianNB()
gnb=LogisticRegression()
model = gnb.fit(train, train_labels)

preds = gnb.predict(test)

# Evaluate accuracy
print(accuracy_score(test_labels, preds))

Sir plz guide me where I am making mistake and how can I increase the accuracy of the classifier?

Thanks

Reply
- Jason Brownlee April 21, 2020 at 6:08 am #
  
  Thanks!
  
  This is a common question that I answer here:
  https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code
  
  Reply
  - Asif Rehman April 21, 2020 at 7:55 am #
    
    Sir you wanted to pointed me out to the following:
    
    How do I calculate accuracy for Regression
    How do I improve model skill
    ??
    
    Reply
    - Jason Brownlee April 21, 2020 at 8:33 am #
      
      This is a common question that I answer here:
      https://machinelearningmastery.com/faq/single-faq/how-do-i-calculate-accuracy-for-regression
      
      This will help you lift performance of a model:
      https://machinelearningmastery.com/machine-learning-performance-improvement-cheat-sheet/
      
      Reply
Dina April 29, 2020 at 5:11 pm #

Hi JAson, I have 16 feature where i want to predict the score. I already do this coding to predict the single value

X = [113,1546,56,54,156,5,57,54,486,648,489,846,489,87,648,652]
ynew = model1.predict(Xnew)
print(“X=%s, Predicted=%s” % (Xnew[0], ynew[0]))

Unfortunately, Ive got this error
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 1 array(s), but instead got the following list of 16 arrays: [array([[-1.5662]]), array([[683.4411]]), array([[15.1]]), array([[60.5264]]), array([[16.1]]), array([[15.2999]]), array([[4945.5514]]), array([[6213.1985]]), array([[5895.772]]), array([[482.4389]])…

Do you have an idea on how I should handle the data?

Reply
- Jason Brownlee April 30, 2020 at 6:37 am #
  
  The error suggests the dat you are passing to the model does not match the expectations of the model.
  
  Perhaps change the data to match the model or change the model to match the data.
  
  Reply
Nora May 18, 2020 at 1:18 pm #

HI Jason. May I know on how to proceed with single prediction which user keyin input from lstm model?

Reply
- Jason Brownlee May 18, 2020 at 1:27 pm #
  
  Yes, see this:
  https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
  
  Reply
  - Nora May 18, 2020 at 2:19 pm #
    
    I already tried on this code:
    #Get the models predicted price values
    x_new = [-1.5662,683.4411,15.1,60.5264,16.1,15.2999,4945.5514,6213.1985,5895.772,482.4389,6.0705,6.0601,13.5,40.4294,13.5,13.2999]
    y = model.predict(x_new)
    print(y)
    
    but I always got an issue with input that I had keyin. Do u know why?
    
    ValueError: Error when checking input: expected lstm_7_input to have 3 dimensions, but got array with shape (16, 1)
    
    Reply
    - Jason Brownlee May 19, 2020 at 5:53 am #
      
      Input shape must be 3d, e.g. 1 sample, 16 time steps and 1 feature.
      
      Try reshaping your sample to be 3d.
      
      Reply
Rahil May 29, 2020 at 8:01 pm #

Your tutorial are always awesome and helpful. Many thanks Jason!

I have about 400 cancer data that are categorized in 3 groups (features are gene expression values). I want to do probability prediction for new data.

Could you please name some of classification models which are able to make probability prediction?
Do you have any tutorial in R as well for this topic?

Reply
- Jason Brownlee May 30, 2020 at 5:57 am #
  
  Thanks.
  
  Perhaps this will help:
  https://machinelearningmastery.com/non-linear-classification-in-r/
  
  Reply
chirag agrawal August 9, 2020 at 11:50 pm #

i need an answer that why my sklearn model is predicting the same output no matter through which model i predict or whatever batch of inputs i give, I get same output for model 1 i.e. 0.96 and with model 2 i.e. 0.81-0.82, any solution??

Reply
- Jason Brownlee August 10, 2020 at 5:48 am #
  
  It might suggest the model does not have skill.
  
  Reply
Howard August 13, 2020 at 11:13 pm #

Hi Jason, I built a model X using random forest with 500 samples(row), which have four x columns, [x1, x2, x3, x4], and a label y. The score of model is about 0.61.

Then, I have an unseen data Xhat about 5,000,000 samples(row), which also have four columns, [x1, x2, x3, x4] but without label y, and I would like to use the model made by X to predict(Xhat) for the new label y. (The Xhat may contain the dataset X, may be not, but the four columns have similar data types)

Although I have got labels in this way finally, I am really curious about if the function predict() could work well at this stage? Is it correct or resonable way? The score below 0.7 is reasonable? And the predicted label of Xhat is correct?

Thanks!!

Reply
- Jason Brownlee August 14, 2020 at 6:06 am #
  
  Yes, this is the purpose of a predictive model, to make predictions on new data.
  
  Reply
  - Howard August 14, 2020 at 2:03 pm #
    
    Thank you for your prompt reply!
    
    I have another question, that is I wonder why we could use 500 samples to predict the huge dataframe, even if we don’t know the distribution of the 500 samples among (or not in) the huge dataframe?
    
    Thanks!!
    
    Reply
    - Jason Brownlee August 15, 2020 at 6:14 am #
      
      The model assumes the distribution of any new data matches the distribution of the training dataset.
      
      Perhaps I don’t understand your question, if so, perhaps you could elaborate or rephrase it?
      
      Reply
      - Howard August 15, 2020 at 5:29 pm #
        
        Thank a lot, I got it !
        
        When we choose the trained model to predict unseen data, it assume that there are some relationship between them.
        
        Is it means that even if the actual classification trend of new data do not match the trained model, it still could work, but would get some wrong predicted labels?
        
        If so, sould I test the accuracy of the predicted labels in any ways?
        
        That’s because in my trained model, columns “x” is Symmetrized Segment-Path Distance (SSPD) of different modes, and label “y” is the classification of traffic modes. I would like to know whether our trained model samples’ trend or distribution match the whole unseen data, because the trained label “y” is only a survey on 500 people, I wonder in the limited situation, it could be used as a model to predict 5,000,000 people(unseen data) as well, especially the model score is only 0.61.
      - Jason Brownlee August 16, 2020 at 5:49 am #
        
        Correct.
        
        Yes, always train and evaluate your model on data that has the same general distribution as the data for which you intend to use the model later to make predictions.
fatma zohra October 11, 2020 at 1:44 pm #

Hello Jason,
first ,thank you for the precious content you are sharing,much appreciated,
Now py question is : i have built my model and saved it , now i would like to evaluate it , and i want to get the y_pred ,but i did’t know how?
thanks in advance,

Reply
- Jason Brownlee October 12, 2020 at 6:39 am #
  
  You’re welcome.
  
  No need to evaluate it, you already have previously. Models are evaluated using cross validation:
  https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/
  
  This post is about after you have evaluated your models and chosen one.
  
  Reply
Valdemar Sousa October 15, 2020 at 3:27 am #

hello jason, since already congratulating all your posts here on the site, they are simply very good. I am new to the ML world, and I have read many of your articles. Right now I’m doing a project where I have a database of network infrastructure alames in a space of 3 months, in this case a time series problem. What advice can you give me regarding the problem, and which ML algorithms are most recommended for this case?
Thank you very much and continue with the touturiais

Reply
- Valdemar Sousa October 15, 2020 at 3:50 am #
  
  my goal is to take the various features and predict the next x alarms. We have the date and time of the alarm, your specific problem and other features, where the alarm occurs and resources associated with the alarm.
  Tips for this problem and what algorithms and methods to use
  
  Reply
  - Jason Brownlee October 15, 2020 at 6:17 am #
    
    Sounds like time series classification / anomaly detection, perhaps this will help:
    https://machinelearningmastery.com/faq/single-faq/how-do-i-model-anomaly-detection
    
    Reply
- Jason Brownlee October 15, 2020 at 6:15 am #
  
  Thanks!
  
  Good question, I recommend this framework:
  https://machinelearningmastery.com/how-to-develop-a-skilful-time-series-forecasting-model/
  
  Reply
HSA November 3, 2020 at 9:31 pm #

I did not understand why predict_prob does not give me the probability of each class

for example in examples:
outputs = model.predict_proba(example)

the result is:
outputs [[0.00224989]]
and I want both probabilities

Reply
- Jason Brownlee November 4, 2020 at 6:39 am #
  
  This is P(class=1), you can get P(class=0) via:
  
  P(class=0) = 1 – P(class=1)
  
  Reply
HSA November 4, 2020 at 8:20 pm #

is there another way to do that? I mean I am sure there is a function retrieve both classes probability, I need them for interpretation and LIME requires both probabilities at once of multiple examples, I worked on the Deep model, so there must be a way to do that as LR does

Reply
- Jason Brownlee November 5, 2020 at 6:34 am #
  
  You can call predict_proba() to get probabilities and call argmax() on the probabilities to get the class values.
  
  Reply
Israel February 1, 2021 at 1:07 am #

Hello, thanks a lot for this informative article. I have a question please. If you have datasets you have used to train your model and you want the model, when it receiveds a dataset that doest not fall among the stored datasets, to say: “Target not recognized”. Please how do you write the algorithm? Thank you.

Reply
- Jason Brownlee February 1, 2021 at 6:27 am #
  
  Perhaps you can add a new class to the prediction problem which is “none of the other classes”, as long as you have examples of this “class”.
  
  Reply
  - Israel February 2, 2021 at 11:35 pm #
    
    Okay, thanks a lot for the suggestion. I will give it a try.
    
    Reply
    - Jason Brownlee February 3, 2021 at 6:19 am #
      
      You’re welcome.
      
      Reply
Paridh March 9, 2021 at 4:51 am #

This was extremely helpful for me as I’m completely new to coding but I had a question if you’d mind answering:
Q: Similar to ‘predict_proba’ is there any other way to give the quantity of all X variables as an output that satisfies the input conditions?

e.g.
Search criteria for predicting whether a kid will have a lollipop or not.
age=10
gender=F

IN: model.predict(X)
OUT: [1]

IN: model.predict_proba(X)
OUT: [0.09, 0.91]

I want something like
IN: model.predict_####(X)
OUT: [27,273]

is this possible??

Reply
- Jason Brownlee March 9, 2021 at 5:24 am #
  
  Sorry, I don’t understand what you’re asking.
  
  Perhaps you can rephrase your question or elaborate?
  
  Reply
Tony Escelante April 25, 2021 at 9:59 am #

Hello Jason, thank you for your post; it has been very helpful to me.
So, my question is about the reverse of the whole process.
Right now, I am working on a CSV file that has a prediction column that has 60000 rows. I need to find which algorithm used for this prediction column, and more importantly, which values are set for this algorithm. I am 90 % sure that the model is Linear Regression Model.

What have I tried so far are these:

Dropping categorical columns and implementing Multiple Linear Regression steps you used to the dataset, and then compared the prediction column & the outcome of this trial => Outcome is wildly different ( e.g. prediction column’s sum() value = -75, my trials sum value() = 10942

Encoding categorical columns with sklearn’s DictVectorizer and implementing Multiple Linear Regression steps you used to the dataset, and then compared the prediction column & the outcome of this trial => Outcome is wildly different ( e.g. prediction column’s sum() value = -75, my trials sum value() = 16942

P.S.:The dataset that I’ve worked on has also an actual values column.

How can I reverse the whole process and figure out which model is used & which parameters are set to create a prediction column?

Reply
- Jason Brownlee April 26, 2021 at 5:33 am #
  
  Good question, these tips may help:
  https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use
  
  Reply
Nova Silvia April 28, 2021 at 5:40 pm #

hello can i ask you
if you predict new data again, will the predictions be the same?
because I’ve tried the same thing but the prediction remains the same. as if the prediction were memory

Reply
- Jason Brownlee April 29, 2021 at 6:24 am #
  
  Good question, this will help:
  https://machinelearningmastery.com/faq/single-faq/why-do-i-get-different-results-each-time-i-run-the-code
  
  Reply
cc June 8, 2021 at 1:56 am #

Thank you Dr. Brownlee. You are a career saver.

Reply
- Jason Brownlee June 8, 2021 at 7:17 am #
  
  You’re welcome!
  
  Reply
İpek Sırma June 10, 2021 at 5:09 am #

Thank you! You saved me from a lot of searching. That’s what I all need. Thanks again.

Reply
- Jason Brownlee June 10, 2021 at 5:26 am #
  
  You’re welcome!
  
  Reply
Lawrence Madziwa June 24, 2021 at 11:14 pm #

Dear Dr Jason
I am running this piece of code but I am struggling with the sorting out the error message below. Please help.

pipe = Pipeline([(“scaler”, StandardScaler()), (“regressor”, SVR(kernel=’rbf’, C=1e3, gamma=0.1))])
pipe.fit(X_train, y_train)

y_pred = pipeline.predict(X_test)
y_pred[-5:]

NotFittedError Traceback (most recent call last)
in
2 #y_pred = pipe.predict(X_test)
3 #from sklearn.metrics import predict
—-> 4 y_pred = pipeline.predict(X_train)
5 #y_pred = clf.predict(X_test)
6 y_pred[-5:]

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\metaestimators.py in (*args, **kwargs)
116
117 # lambda, but not partial, allows help() to work with update_wrapper
–> 118 out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
119 # update the docstring of the returned function
120 update_wrapper(out, self.fn)

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\pipeline.py in predict(self, X, **predict_params)
329 for name, transform in self.steps[:-1]:
330 if transform is not None:
–> 331 Xt = transform.transform(Xt)
332 return self.steps[-1][-1].predict(Xt, **predict_params)
333

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\preprocessing\data.py in transform(self, X, y, copy)
745 DeprecationWarning)
746
–> 747 check_is_fitted(self, ‘scale_’)
748
749 copy = copy if copy is not None else self.copy

~\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_is_fitted(estimator, attributes, msg, all_or_any)
949
950 if not all_or_any([hasattr(estimator, attr) for attr in attributes]):
–> 951 raise NotFittedError(msg % {‘name’: type(estimator).__name__})
952
953

NotFittedError: This StandardScaler instance is not fitted yet. Call ‘fit’ with appropriate arguments before using this method.

Reply
- Jason Brownlee June 25, 2021 at 6:17 am #
  
  Sorry to hear that, this may help:
  https://machinelearningmastery.com/faq/single-faq/why-does-the-code-in-the-tutorial-not-work-for-me
  
  Reply
Fatima July 16, 2021 at 9:56 pm #

Hello Jason, I applied the Random Forest model in my data set for classification purposes, when I finished the model I generated the results which is the performance metrics for the model, then I saved the model through this code
///
import joblib
# save the model to disk
filename = ‘RFten_model.sav’
joblib.dump(model, filename)
////////
,, Later I loaded the model because
I want to apply a prediction for one instance
I wrote this code
# define one new data instance
print (“The new pridiction is: “, Y_prediction.predict(np.array([1,0,0.08,0.46,0.15,0.08,0,0,0,0,0.08,0,0,0,0,0.15,0.84,0,0,0.15,0,0,1,0,0,0,0,1,0,0,0])))

I have this error How I can solve it?
AttributeError: ‘numpy.ndarray’ object has no attribute ‘predict’

Thanks!

Reply
- Jason Brownlee July 17, 2021 at 5:23 am #
  
  Perhaps this will help you save your model correctly:
  https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/
  
  Reply
Rayan July 22, 2021 at 4:50 am #

Hi Jason
Suppose we have a competition with 1000 tennis players. we have a dataset including each row of player 1 and player 2 and skills rate for each of them. the more the skill probability of winning is more.
each row: player i, player j, skills of player i, skills of player 2, wine: player 2 (example).

How can predict the winner based on the skills for each match (what model can help me with the predictions)?
How can I write a function?
Is there any example?

Reply
- Jason Brownlee July 22, 2021 at 5:38 am #
  
  This may help:
  https://machinelearningmastery.com/faq/single-faq/what-methods-can-be-used-for-modeling-sports
  
  Reply
Andrew January 5, 2022 at 4:17 pm #

Hi Hi Jason, I have a music database in which I have nothing but a file and an attribute as a genre, I need to shuffle it by genre, I also have a trained machine learning model, on a database for machine learning for my project, like me to deduce predictions of the machine into the genre, if it gives me only arrays?

Reply
- James Carmichael January 6, 2022 at 10:56 am #
  
  Hi Andrew…You may want to investigate the softmax function:
  
  https://machinelearningmastery.com/softmax-activation-function-with-python/
  
  Regards,
  
  Reply
Andrew January 6, 2022 at 1:16 am #

Hi Jason, I have a trained machine learning model (genre classifier), how can I bring machine predictions into a genre if it only gives me arrays?

Reply
Andrew January 7, 2022 at 8:39 pm #

Thank you, this really made the project easier for me, but another question, still, how can I turn arrays into genres?
[[0.0026979 0.00919354 0.00297058 0.00202884 0.00236537 0.00417573
0.00427148 0.02581489 0.02394866 0.00905851 0.01048179 0.00299542
0.00631784 0.01860214 0.00377617 0.00204791 0.00208164 0.00215192
0.00219006 0.00226448 0.00250438 0.00199827 0.00558735 0.00240505
0.00497674 0.00677883]
this is one of the arrays

Reply
- James Carmichael January 8, 2022 at 11:03 am #
  
  Hi Andrew…You may wish to consider mutli-label classification.
  
  https://machinelearningmastery.com/multi-label-classification-with-deep-learning/
  
  Regards,
  
  Reply
Long September 26, 2022 at 3:13 pm #

Hello James, could you help me get this to somewhat work to what I am looking for?

So I think this is the one that fits my needs:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# generate regression dataset
X, y = make_regression(n_samples=1000, n_features=86, noise=0.1)
# fit final model
model = LinearRegression()
model.fit(X, y)
# define one new data instance
Xnew = [[2, 1, 1, 1, 2, 3, 2, 2, 3, 3, 1, 3, 3, 1, 3, 3, 1, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2, 3, 3, 1, 3, 3, 2, 2, 3, 2, 1, 2, 3, 1, 3, 3, 1, 2, 2, 2, 3, 1, 3, 2, 3, 1, 1, 2, 3, 3, 3, 3, 2, 3, 2, 3, 1, 1, 3, 2, 2, 3, 1, 3, 3, 3, 1, 1, 3, 3, 3, 1, 3, 3, 1, 3, 1, 3, 1, 1]]
# make a prediction
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
print(“X=%s, Predicted=%s” % (Xnew[0], ynew[0]))

But the prediction is wrong.

As you can see in the pattern, it “only” contains 1 2 or 3. Each number represents the actually result for each “round” if you can think of it that way.

What I am looking for in the prediction is for it to give me two possibility of what the next value could be.

So the actually next value is “2”. Even it may guess it wrong, I will push the actually result to the data set so it will always have more data to look at to make the next guess.

Could you assist me with that?

Reply
- James Carmichael September 27, 2022 at 6:06 am #
  
  Hi Long,
  
  Thanks for asking.
  
  I’m eager to help, but I just don’t have the capacity to debug code for you.
  
  I am happy to make some suggestions:
  
  Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
  Consider cutting the problem back to just one or a few simple examples.
  Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
  Consider posting your question and code to StackOverflow.
  
  Reply
Long September 27, 2022 at 1:41 am #

Hello Jason, how can I modify this example to only predict 2/3 numbers in the pattern?

(Note: Each number in the data is the actual result, so I would like it to predict the next round possibility with two possible result)

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# generate regression dataset
X, y = make_regression(n_samples=100, n_features=2, noise=0.1)
# fit final model
model = LinearRegression()
model.fit(X, y)
# define one new data instance
Xnew = [2, 1, 1, 1, 2, 3, 2, 2, 3, 3, 1, 3, 3, 1, 3, 3, 1, 2, 3, 3, 2, 3, 3, 3, 3, 1, 2, 3, 3, 1, 3, 3, 2, 2, 3, 2, 1, 2, 3, 1, 3, 3, 1, 2, 2, 2, 3, 1, 3, 2, 3, 1, 1, 2, 3, 3, 3, 3, 2, 3, 2, 3, 1, 1, 3, 2, 2, 3, 1, 3, 3, 3, 1, 1, 3, 3, 3, 1, 3, 3, 1, 3, 1, 3, 1, 1]]
# make a prediction
ynew = model.predict(Xnew)
# show the inputs and predicted outputs
print(“X=%s, Predicted=%s” % (Xnew[0], ynew[0]))

Any help would be greatly appreciated!

Reply
- James Carmichael September 27, 2022 at 6:05 am #
  
  Hi Long,
  
  I’m eager to help, but I don’t have the capacity to customize the code for your specific needs.
  
  I get a lot of requests like this. I’m sure you can understand my rationale.
  
  I do have some ideas that might help:
  
  Perhaps I already have a tutorial with the change you’re asking for? Search the blog.
  Perhaps you can try to make the change yourself?
  Perhaps you can add a comment below the post with the change you need and I or another reader can make a suggestion?
  Perhaps you can hire a contractor or programmer to make the change?
  Perhaps you can post a description of the code needed on stackoverflow.com?
  
  Reply

Navigation

How to Make Predictions with scikit-learn

How to predict classification or regression outcomes
with scikit-learn models in Python.

Tutorial Overview

1. First Finalize Your Model

2. How to Predict With Classification Models

Class Predictions

Multiple Class Predictions

Single Class Prediction

A Note on Class Labels

Probability Predictions

3. How to Predict With Regression Models

Multiple Regression Predictions

Single Regression Prediction

Further Reading

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

217 Responses to How to Make Predictions with scikit-learn

Leave a Reply Click here to cancel reply.

Navigation

How to predict classification or regression outcomes with scikit-learn models in Python.

Tutorial Overview

1. First Finalize Your Model

2. How to Predict With Classification Models

Class Predictions

Multiple Class Predictions

Single Class Prediction

A Note on Class Labels

Probability Predictions

3. How to Predict With Regression Models

Multiple Regression Predictions

Single Regression Prediction

Further Reading

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

217 Responses to How to Make Predictions with scikit-learn

Leave a Reply Click here to cancel reply.

How to predict classification or regression outcomes
with scikit-learn models in Python.

Finally Bring Machine Learning To
Your Own Projects