#### How to predict classification or regression outcomes

with scikit-learn models in Python.

Once you choose and fit a final machine learning model in scikit-learn, you can use it to make predictions on new data instances.

There is some confusion amongst beginners about how exactly to do this. I often see questions such as:

How do I make predictions with my model in scikit-learn?

In this tutorial, you will discover exactly how you can make classification and regression predictions with a finalized machine learning model in the scikit-learn Python library.

After completing this tutorial, you will know:

- How to finalize a model in order to make it ready for making predictions.
- How to make class and probability predictions in scikit-learn.
- How to make regression predictions in scikit-learn.

Let’s get started.

## Tutorial Overview

This tutorial is divided into 3 parts; they are:

- First Finalize Your Model
- How to Predict With Classification Models
- How to Predict With Regression Models

## 1. First Finalize Your Model

Before you can make predictions, you must train a final model.

You may have trained models using k-fold cross validation or train/test splits of your data. This was done in order to give you an estimate of the skill of the model on out-of-sample data, e.g. new data.

These models have served their purpose and can now be discarded.

You now must train a final model on all of your available data.

You can learn more about how to train a final model here:

## 2. How to Predict With Classification Models

Classification problems are those where the model learns a mapping between input features and an output feature that is a label, such as “*spam*” and “*not spam*.”

Below is sample code of a finalized LogisticRegression model for a simple binary classification problem.

Although we are using *LogisticRegression* in this tutorial, the same functions are available on practically all classification algorithms in scikit-learn.

1 2 3 4 5 6 7 8 |
# example of training a final classification model from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) |

After finalizing your model, you may want to save the model to file, e.g. via pickle. Once saved, you can load the model any time and use it to make predictions. For an example of this, see the post:

For simplicity, we will skip this step for the examples in this tutorial.

There are two types of classification predictions we may wish to make with our finalized model; they are class predictions and probability predictions.

### Class Predictions

A class prediction is: given the finalized model and one or more data instances, predict the class for the data instances.

We do not know the outcome classes for the new data. That is why we need the model in the first place.

We can predict the class for new data instances using our finalized classification model in scikit-learn using the *predict()* function.

For example, we have one or more data instances in an array called *Xnew*. This can be passed to the *predict()* function on our model in order to predict the class values for each instance in the array.

1 2 |
Xnew = [[...], [...]] ynew = model.predict(Xnew) |

### Multiple Class Predictions

Let’s make this concrete with an example of predicting multiple data instances at once.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of training a final classification model from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) # new instances where we do not know the answer Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1) # make a prediction ynew = model.predict(Xnew) # show the inputs and predicted outputs for i in range(len(Xnew)): print("X=%s, Predicted=%s" % (Xnew[i], ynew[i])) |

Running the example predicts the class for the three new data instances, then prints the data and the predictions together.

1 2 3 |
X=[-0.79415228 2.10495117], Predicted=0 X=[-8.25290074 -4.71455545], Predicted=1 X=[-2.18773166 3.33352125], Predicted=0 |

### Single Class Prediction

If you had just one new data instance, you can provide this as instance wrapped in an array to the *predict()* function; for example:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
# example of making a single class prediction from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) # define one new instance Xnew = [[-0.79415228, 2.10495117]] # make a prediction ynew = model.predict(Xnew) print("X=%s, Predicted=%s" % (Xnew[0], ynew[0])) |

Running the example prints the single instance and the predicted class.

1 |
X=[-0.79415228, 2.10495117], Predicted=0 |

### A Note on Class Labels

When you prepared your data, you will have mapped the class values from your domain (such as strings) to integer values. You may have used a LabelEncoder.

This *LabelEncoder* can be used to convert the integers back into string values via the *inverse_transform()* function.

For this reason, you may want to save (pickle) the *LabelEncoder* used to encode your y values when fitting your final model.

### Probability Predictions

Another type of prediction you may wish to make is the probability of the data instance belonging to each class.

This is called a probability prediction where given a new instance, the model returns the probability for each outcome class as a value between 0 and 1.

You can make these types of predictions in scikit-learn by calling the *predict_proba()* function, for example:

1 2 |
Xnew = [[...], [...]] ynew = model.predict_proba(Xnew) |

This function is only available on those classification models capable of making a probability prediction, which is most, but not all, models.

The example below makes a probability prediction for each example in the *Xnew* array of data instance.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of making multiple probability predictions from sklearn.linear_model import LogisticRegression from sklearn.datasets.samples_generator import make_blobs # generate 2d classification dataset X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1) # fit final model model = LogisticRegression() model.fit(X, y) # new instances where we do not know the answer Xnew, _ = make_blobs(n_samples=3, centers=2, n_features=2, random_state=1) # make a prediction ynew = model.predict_proba(Xnew) # show the inputs and predicted probabilities for i in range(len(Xnew)): print("X=%s, Predicted=%s" % (Xnew[i], ynew[i])) |

Running the instance makes the probability predictions and then prints the input data instance and the probability of each instance belonging to the first and second classes (0 and 1).

1 2 3 |
X=[-0.79415228 2.10495117], Predicted=[0.94556472 0.05443528] X=[-8.25290074 -4.71455545], Predicted=[3.60980873e-04 9.99639019e-01] X=[-2.18773166 3.33352125], Predicted=[0.98437415 0.01562585] |

This can be helpful in your application if you want to present the probabilities to the user for expert interpretation.

## 3. How to Predict With Regression Models

Regression is a supervised learning problem where, given input examples, the model learns a mapping to suitable output quantities, such as “0.1” and “0.2”, etc.

Below is an example of a finalized *LinearRegression* model. Again, the functions demonstrated for making regression predictions apply to all of the regression models available in scikit-learn.

1 2 3 4 5 6 7 8 |
# example of training a final regression model from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # generate regression dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1, random_state=1) # fit final model model = LinearRegression() model.fit(X, y) |

We can predict quantities with the finalized regression model by calling the *predict()* function on the finalized model.

As with classification, the predict() function takes a list or array of one or more data instances.

### Multiple Regression Predictions

The example below demonstrates how to make regression predictions on multiple data instances with an unknown expected outcome.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# example of training a final regression model from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # generate regression dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # fit final model model = LinearRegression() model.fit(X, y) # new instances where we do not know the answer Xnew, _ = make_regression(n_samples=3, n_features=2, noise=0.1, random_state=1) # make a prediction ynew = model.predict(Xnew) # show the inputs and predicted outputs for i in range(len(Xnew)): print("X=%s, Predicted=%s" % (Xnew[i], ynew[i])) |

Running the example makes multiple predictions, then prints the inputs and predictions side-by-side for review.

1 2 3 |
X=[-1.07296862 -0.52817175], Predicted=-61.32459258381131 X=[-0.61175641 1.62434536], Predicted=-30.922508147981667 X=[-2.3015387 0.86540763], Predicted=-127.34448527071137 |

### Single Regression Prediction

The same function can be used to make a prediction for a single data instance as long as it is suitably wrapped in a surrounding list or array.

For example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# example of training a final regression model from sklearn.linear_model import LinearRegression from sklearn.datasets import make_regression # generate regression dataset X, y = make_regression(n_samples=100, n_features=2, noise=0.1) # fit final model model = LinearRegression() model.fit(X, y) # define one new data instance Xnew = [[-1.07296862, -0.52817175]] # make a prediction ynew = model.predict(Xnew) # show the inputs and predicted outputs print("X=%s, Predicted=%s" % (Xnew[0], ynew[0])) |

Running the example makes a single prediction and prints the data instance and prediction for review.

1 |
X=[-1.07296862, -0.52817175], Predicted=-77.17947088762787 |

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

- How to Train a Final Machine Learning Model
- Save and Load Machine Learning Models in Python with scikit-learn
- scikit-learn API Reference

### Summary

In this tutorial, you discovered how you can make classification and regression predictions with a finalized machine learning model in the scikit-learn Python library.

Specifically, you learned:

- How to finalize a model in order to make it ready for making predictions.
- How to make class and probability predictions in scikit-learn.
- How to make regression predictions in scikit-learn.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Once again, Jason… you’re answering all the questions that need answering.

I was just working through yesterday how to actually use these highly developed models (which I’ve learned to do expediently from your book by the way) to predict my new input variables. And here in my inbox, you’ve delivered this great article on it!

Thank you for making us all better at Machine Learning. You’re work here is stupendous and appreciated!

Thanks Mitch, I’m glad it helps!

Shoot/post me questions any time.

sureshkumar0707@gmail.com

Segmentation can be performed using Python? machine learning can be applied on the segmentation ?

Yes, I recommend looking into OpenCV.

after getting my trained model i am not sure what exactly we are doing when making

prediction on test like y_pred_m4 = lr_4.predict(X_test_m4)

Yes, that is correct. Call the predict function to make a prediction.

How does one turn predictions into actions? Say I am predicting user fraud, how would you go about taking any given prediction point and determine the customer for that particular prediction.

Great question.

The use of the predictive model would be embedded within an application that is aware of the current customer for which a prediction is being made.

Great post. Love to see an example of the same in R.

Thanks for the suggestion.

I like your explanation but I am missing one thing.

How do you encode and scale features for Xnew so they match trained data?

You must sale new data using the procedure you used to scale training data.

This might mean keeping objects or coefficients used to prepare training data to then apply on new data in the future, such as min/max, a vocab, etc. depending on the problem type.

Thank you very much for the explanation

But my question is how to use the Source Code as an .exe Application to use it later without a script engine

You can use code in your application as you would any other software engineering project.

I’m sorry, I am not an expert at creating executable files on Windows. I have not used the platform in nearly 2 decades.

how to save a label encoder and reuse it again across different python files.

i have encoded my data in training phase and while i am trying to predict labels in testing phase i am not able to get same label encoders so i am getting wrong prediction

please help..

How to save (pickle) the LabelEncoder used to encode your y values when fitting your final model.

You can use the pickle dump/load functions directly.

Hi Jason, (relatively new to ML)

I have a data frame with,

1 ID column

6 feature columns

1 target column

when I train/test split the feature and target columns and do predictions etc, that is where I need to map back to the ID.

I want to be able to view something like this after my predictions:

A data frame with,

1 ID column

6 feature columns

1 target column

1 predicted column

Would you be able to help me with this? Really appreciate it,

kevin

The predictions are made in the order of the inputs.

You can take the array of predictions and align them with your inputs directly and start using them.

Does that help? If not, what is the specific problem you are having?

Thanks Jason, I suppose I have reached a point where I can get my final model and I cannot seem to find any information as to what next, i.e. real specifics regarding making predictions with new datasets.

There are a trillion examples of how to work with train/test split and refining models, but my end goal is taking a ‘complete’ dataset, plugging it into my model prediction and producing back my initial ‘complete’ dataset PLUS my predicted column(s).

I work for a credit union in DC and I have a list of member data, such things like member number, name, phone number, address, account balances and various other features I would use for prediction. I would like to feed this ‘complete’ dataset into my prediction model and have it spit out my initial ‘complete’ dataset PLUS my predicted column(s) that someone could then use to reach out marketing related messages to the member, depending on the prediction off course.

Hope that makes sense..

thanks again Jason, appreciate your time (how do you find the time?!!)

Yes, that makes sense.

You will need to write code to take the input to the model (X) and pass it to the model to make predictions model.predict(X) to get the prediction column (yhat).

You then have the dataset X and the predictions yhat and the rows in one correspond to rows in the other. You can hstack() the arrays to have one large matrix of inputs and predictions.

What problem specifically are you having in achieving this?

Thank you Jason, not so much a problem as a lack of experience trying to tie it all together, but with your help we’ll get there!!

thanks again.

Hang in there Kevin!

Exactly stuck with the same issue.

If I dont supply ID column initially, since i have to make the machine learn ONLY the “Features”, there is no way i’m able to map the IDs back once i get the predicted result set, say y_pred from the test dataset.

The input rows will be ordered the same as the output predictions. The mapping can be based on numpy array index.

Thank you Jason. Will try.

Need to mention, have learnt a lot from your articles. Thank you so much !

Thanks.

Wanted to update that i was able to crack this at last. During the test / train split, on the left hand side, i had to include x_index, y_index and inside the train_test_split, i had to add dataset.index. Thats it. It works

Glad to hear it.

Thanks Laks. I didn’t know that this would work.

Hello Jason, I’ve got started working with scikit-learn models to predict further values but there is something I don’t clearly understand: Let’s suppose I do have a Stock Exchange price datasets with Date, Open Price, Close Price, and the variation rate from the previous date, for a single asset or position. In order to ‘fit’ a good prediction, I decided to use a Multiple Linear Regression and a Polynomial Feature also: I can obtain a formula even used a support vector machine (SVR) but I don’t know how to predict a NEW dataset, since the previous one has more than one variable (Open Price, Variation Rate, Date). How can I simulate further values?

Thanks for your response.

The tutorial above shows how to make a prediction with new data.

What problem are you having exactly?

Thank you so much, Jason, for this Great post! can you please tell me, I have used LabelEncoder for three Cities. So now I have to take input from a user as a string and convert them into int using LabelEncoder and provide it to trained model for prediction. Is it correct?

I believe so.

Hi Jason, always a pleasure seeing your blogs.

I’m thinking of a few things in regard to measuring the “accuracy” of a regression model and making use of such a model, would love to hear your thoughts.

I have problem that can either be framed as a classification problem (discrete labels) but also as a regression problem (a similar example could be price range or exact price ). After trying out a few models, I liked the use of a (random forest) regression model.

Besides evaluating the model on things like R^2 and RMSE I’m doing a sort of pseudo accuracy evaluation.

Say I have a prediction and a true value of

[19.8, 20]

So by true accuracy as in a classification problem the above is wrong, but if I define a new measure that tolerates answers within something fitting to the problem like +/- 2 or something like +/- 10% of the predicted value then the prediction is correct and the model will have greater accuracy. And then the prediction of a given sample would read something like x +/- y .

Or how would you display/interpret the predictions made by a regression model? Is it “correct” to measure the success as a pseudo accuracy as above? Or is it more correct and robust to express a prediction using e.g. RMSE as pred = x +/- RMSE ? Should I avoid this line of thinking when it comes to regression problems completely? And if such, how would I display my prediction of a given sample with a fitting confidence since the regression model typically is close but not always spot on the true value?

It sounds like evaluating the mode as a regression model would be better.

You would use MAE or RMSE to describe the expected error on average in the same units as the output variable.

E.g. The RMSE is xx.x on average +/- x.

You can use a confidence interval, I explain more here:

https://machinelearningmastery.com/confidence-intervals-for-machine-learning/

Hi Jason,

when I am assigning the X_Test to y_pred, it is returning the below shown error, can you please explain why?

y_pred = classifier.predict(X_Test)

Error:

NotFittedError Traceback (most recent call last)

in ()

—-> 1 y_pred = classifier.predict(X_Test)

C:\Anaconda\lib\site-packages\sklearn\neighbors\classification.py in predict(self, X)

143 X = check_array(X, accept_sparse=’csr’)

144

–> 145 neigh_dist, neigh_ind = self.kneighbors(X)

146

147 classes_ = self.classes_

C:\Anaconda\lib\site-packages\sklearn\neighbors\base.py in kneighbors(self, X, n_neighbors, return_distance)

325 “””

326 if self._fit_method is None:

–> 327 raise NotFittedError(“Must fit neighbors before querying.”)

328

329 if n_neighbors is None:

NotFittedError: Must fit neighbors before querying.

It suggests that perhaps your model has not been fit on the data.

Any suggest how to eliminate predict data if predict data it’s far from data set which have been trained before. example i’m using SVM with label 1 : 4,4,3,4,4,3 label 2: 5,6,7,5,6,5 . and i’m predict data 20, i want the predict data (20) result is “not valid” or don’t show label 1 or 2.

Sorry, I don’t follow. Are you able to give more context?

sorry, if it’s doesn’t clear. Let say i want to make predict it’s apple or orange. suddenly i insert a grape data to predict into model that i have create(apple or orange). In my case the predict result will be apple or orange (i’m using SVM). so, how to know if my input data (grape data) it’s far different from the data train (apple and orange data). Thanks

You might want to predict probabilities instead of classes and choose to not use a prediction if the predicted probabilities are below a threshold.

i will try, thank you very much

Hello, I used scikit learn to predict google stock prices with MLPRegressor. How can I predict new values beyond dataset specially test data?

The above post will help. That problem are you having exactly?

Hi Jason,

Can I fit a model by multiple K-Fold iteration for very unbalance class as shown below??

Could You kindly help on this!

for val in range(0,1000): #total sample is 20k majority class and 20 minority class

balanced_copy_idx=balanced_subsample(labels,40) #creating each time randomly 20Minority class and 20 majority class

X1=X[balanced_copy_idx]

y1=y[balanced_copy_idx]

kf = KFold(y1.shape[0], n_folds=10,shuffle= True,random_state=3)

for train_index, test_index in kf:

X_train, y_train = X1[train_index], y1[train_index]

X_test, y_test = X1[test_index], y1[test_index]

vectorizer = TfidfVectorizer(max_features=15000, lowercase = True, min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True,stop_words=’english’)

train_corpus_tf_idf = vectorizer.fit_transform(X_train)

test_corpus_tf_idf = vectorizer.transform(X_test)

model1 = LogisticRegression()

model1.fit(train_corpus_tf_idf,y_train)

Yes you can.

Sorry, I cannot review and debug your code, perhaps post on stackoverflow?

What is the purpose of random state? when i try to run my prediction the accuracy is not stable but when i input the random state = 0 it gives stable prediction but low accuracy when i change the random state to 100 it give me higher accuracy

To fix the pseudo random number generator.

You can learn more about randomness in machine learning here:

https://machinelearningmastery.com/faq/single-faq/what-value-should-i-set-for-the-random-number-seed

Thanks for fast reply more power god bless 🙂

Hi Jason, thank you for your always useful and insightful posts!

One question that seems to be a recurrent issue regarding predict_proba(). For which sklearn models can it be used? Eg. can it be used for logistic regression, SVM, naive Bayes and random forest? I was playing with it recently for both binary and multiclass classification and it seemed to be producing the following paradox: probability vectors for each sample, in which the smallest probability was assigned to the class that was actually being predicted. Is it possible that predict_proba() generates (1-prob) results?

Not all, some don’t support predicting probabilities natively and some that don’t may use a decision_function() instead.

How can I make the prediction more detailed? Like say I input a hibiscus flower into this model, instead of probabilities, I want to get something like “input not a Iris, it was off by blahblahblah”, and probably take a decision. I think that’s what Black Manga meant in his comments above

You would have to write this “interpretation” yourself.

Hi, Jason… It is a great article indeed.

I have a question,

I have trained my models and have saved the model.

The next part that I would like to put it into production to build a .py file which function is only to predict the given sets of parameters.

How should I code my predict.py file so using command line, I can just input some variables and get the output.

Should I also add the import functions inside my predict.py?

Thanks in advance!

This sounds like a software engineering question, rather than an ML question.

You can use Python functions to read input and write output from your script. Perhaps check the Python standard API or a good reference text?

Thanks for the attempt, but unfortunately, I did not find this post very helpful because it failed to present *simple* examples (IMO). E.g. if the distinction between Classification Models and Regression Models is paramount, please include a link that sheds light on it. Missing here are useful demonstrations on how to perform the *simplest* possible prediction. E.g. a data set as straightforward as world population/year or home price/bathrooms: show how to load the data, then “ask” the algorithm for a prediction for a specific value, e.g. what will the world population be in 2020? What is the predicted home value of a house with 3 bathrooms? Something simpler than unexplained complex variable of

`Xnew = [[-1.07296862, -0.52817175]]`

— sorry, I don’t have any idea what that is.I know I’m new to ML, but I feel this post could be far more useful if it tackled the simplest possible example and then transitioned up to what is here. Examples examples examples: those are the only things that really matter.

Great feedback, thanks.

I guess this post is not for absolute beginners, but rather those that are using the sklearn API and looking to better understand the predict() functions.

This post explains the difference between classification and regress clearly:

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/

This post provides a gentle introduction to working through a project end to end:

https://machinelearningmastery.com/machine-learning-in-python-step-by-step/

Does that help?

Dear Jason, Thank you very much for the great posts. I finalize my model and now I want to train the model with the X_validation data. I want to export my results in a csv file. How can I change my code to have a csv file with 3 columns. The ID of the value(1st column), the real label of the value (2nd column), and the prediction in the last column?

predictions = model.predict(X_validation)

print(accuracy_score(Y_validation, predictions))

print(confusion_matrix(Y_validation, predictions))

print(classification_report(Y_validation, predictions))

prediction_df = pd.DataFrame(predictions)

prediction_df.to_csv(‘result.csv’)

You can create the datastructure you require in memory as numpy arrays or dataframes, then save it to CSV.

Hello Jason,

Can we predict answers ( not classes), i see using scikit-learn we are classifying the data into new classes and we are predicting the class like SPAM, HAM etc ..

How we goona approach when we have to predict or fetch an answer ( from existing repository) , do i need to train the model with answers ( long answers not only just classification )

Thanks

Sujan

If you need to output text, you may need a text generation algorithm. You can get started with this here:

https://machinelearningmastery.com/start-here/#nlp

Sir,

In my dataset, i have 25000 records which contains one input value and three output values.

How can I make prediction of 3 output variables using this only one input variable?

Yes, perhaps try a neural network given that a vector prediction is required.

Sir,

Is there any sample code or web links for that concept?

Yes, I’m sure I have a few examples on the blog.

You can achieve the result by setting the number of nodes in the output layer to the size of the vector required.

Sir,

I have searched in your blog. But I couldnt found the examples. Sir, can you please send me the link?

Sure, see the “vector output” example in this post:

https://machinelearningmastery.com/how-to-develop-multilayer-perceptron-models-for-time-series-forecasting/

Sir,

Thank you for your reply.

Sir, my data is not a time series data.

The data set contains three input variables and a corresponding output variable(output value is determined by a software by changing the input variables).

So, my problem is: if a user gives a target output value, the model should predict the corresponding input variables. It is not a time series data,Sir,

Understood, but you can use the vector output example to get started.

Dataset looks like this Sir,

input 1 | input 2 | input 3| output

150 | 2.356 | 10000 | 4.56

(output is predicted using a software which takes input 1, input 2 and input 3 as inputs)

The problem is: If a user gives a target output, my model should predict the corresponding input1, input2 and input3 values.

Hi Jason, Do different algorithms take different time while doing batch prediction (let’s assume for 1000 rows)? If yes, I would like to know your view on why do they take different time.Could you please help me to clear this doubt? Thanks in advance.

Yes, different algorithms may have different computational complexity when making a prediction.

Hi Jason

I have scaled X_Train and fit model with X_Train, Y_train. When i predict a single value then Output (Predicted value ) is according to the Scaled X_Train Input. I am not able to map the scaled Input and Original Output and not able to map the predicted output to which original(before scale) it belongs to.

How can i refer back to the Original value and its target value and confirm they in sync with Scaled Input and its output ?

You can save the instance of the object used to perform the transform, then call the inverse_transform function.

Hi all, I have a survey data and using it for the purpose of historic data , I want to implement ML for predicting the answer to next question by the respondent.

what approach should i Follow?

Good question, perhaps this will help:

https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use

Thank you so much.This post was of great help!

I’m happy it helped.

thank you, how I can count accuracy for prediction

You don’t, instead you calculate error.

See this post:

https://machinelearningmastery.com/classification-versus-regression-in-machine-learning/

Hi Jason,

In Random Forest Regression model, is it possible to give specific numeric range for each sample and we need output within that range only. Range can be different for each inputs.

Thanks.

Perhaps you could achieve that with post-processing of the prediction from the model? Like re-scaling?

Thanks Jason for your reply and you are doing great job.

Post processing or re-scaling would be the proper solution?

I have multiple inputs & multiple outputs. Can I tell model that these particular inputs will have more impact on these particular output ?

You could use feature selection and/or feature importance methods to give a rough idea of relative feature impact on model skill.

Thanks Jason.

Hi Jason, I hope you would see this quickly since it’s urgent.

After training a logistic regression model from sklearn on some training data using train_test split,fitting the model by model.fit(), I can get the logistic regression coefficients by the attribute model.coef_ , right? Now, my question is, how can I use this coefficients to predict a separate, single test data?

You can use the model directly via model.predict()

There are more examples here:

https://machinelearningmastery.com/make-predictions-scikit-learn/

“””

Random Forest implementation with CART decision trees

This version is for continuous dataset (feature values)

Author: Jamie Deng

Date: 09/10/2018

“””

import numpy as np

import pandas as pd

from collections import Counter

from sklearn.utils import resample

from sklearn.metrics import accuracy_score

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder

from multiprocessing import Pool

from random import sample

import time

from sklearn.ensemble import RandomForestRegressor

np.seterr(divide=’ignore’, invalid=’ignore’) # ignore Runtime Warning about divide

class TreeNode:

def __init__(self, n_features):

self.n_features = n_features

self.left_child = None

self.right_child = None

self.split_feature = None

self.split_value = None

self.split_gini = 1

self.label = None

def is_leaf(self):

return self.label is not None

“”” use 2d array (matrix) to compute gini index. Numerical feature values only “””

def gini(self, f, y, target):

trans = f.reshape(len(f), -1) # transpose 1d np array

a = np.concatenate((trans, target), axis=1) # vertical concatenation

a = a[a[:, 0].argsort()] # sort by column 0, feature values

sort = a[:, 0]

split = (sort[0:-1] + sort[1:]) / 2 # compute possible split values

left, right = np.array([split]), np.array([split])

classes, counts = np.unique(y, return_counts=True)

n_classes = len(classes)

# count occurrence of labels for each possible split value

for i in range(n_classes):

temp = a[:, -n_classes + i].cumsum()[:-1]

left = np.vstack((left, temp)) # horizontal concatenation

right = np.vstack((right, counts[i] – temp))

sum_1 = left[1:, :].sum(axis=0) # sum occurrence of labels

sum_2 = right[1:, :].sum(axis=0)

n = len(split)

gini_t1, gini_t2 = [1] * n, [1] * n

# calculate left and right gini

for i in range(n_classes):

gini_t1 -= (left[i + 1, :] / sum_1) ** 2

gini_t2 -= (right[i + 1, :] / sum_2) ** 2

s = sum(counts)

g = gini_t1 * sum_1 / s + gini_t2 * sum_2 / s

g = list(g)

min_g = min(g)

split_value = split[g.index(min_g)]

print(“hello0”)

return split_value, min_g

def split_feature_value(self, x, y, target):

# compute gini index of every column

n = x.shape[1] # number of x columns

sub_features = sample(range(n), self.n_features) # feature sub-space

# list of (split_value, split_gini) tuples

value_g = [self.gini(x[:, i], y, target) for i in sub_features]

result = min(value_g, key=lambda t: t[1]) # (value, gini) tuple with min gini

feature = sub_features[value_g.index(result)] # feature with min gini

return feature, result[0], result[1] # split feature, value, gini

# recursively grow the tree

def attempt_split(self, x, y, target):

c = Counter(y)

majority = c.most_common()[0] # majority class and count

label, count = majority[0], majority[1]

if len(y) 0.9: # stop criterion

self.label = label # set leaf

return

# split feature, value, gini

feature, value, split_gini = self.split_feature_value(x, y, target)

# stop split when gini decrease smaller than some threshold

if self.split_gini – split_gini < 0.01: # stop criterion

self.label = label # set leaf

return

index1 = x[:, feature] value

x1, y1, x2, y2 = x[index1], y[index1], x[index2], y[index2]

target1, target2 = target[index1], target[index2]

if len(y2) == 0 or len(y1) == 0: # stop split

self.label = label # set leaf

return

# splitting procedure

self.split_feature = feature

self.split_value = value

self.split_gini = split_gini

self.left_child, self.right_child = TreeNode(self.n_features), TreeNode(self.n_features)

self.left_child.split_gini, self.right_child.split_gini = split_gini, split_gini

self.left_child.attempt_split(x1, y1, target1)

self.right_child.attempt_split(x2, y2, target2)

# trance down the tree for each data instance, for prediction

def sort(self, x): # x is 1d array

if self.label is not None:

return self.label

if x[self.split_feature] <= self.split_value:

return self.left_child.sort(x)

else:

return self.right_child.sort(x)

class ClassifierTree:

def __init__(self, n_features):

self.root = TreeNode(n_features)

def train(self, x, y):

# one hot encoded target is for gini index calculation

# categories='auto' silence future warning

encoder = OneHotEncoder(categories='auto')

labels = y.reshape(len(y), -1) # transpose 1d np array

target = encoder.fit_transform(labels).toarray()

self.root.attempt_split(x, y, target)

def classify(self, x): # x is 2d array

return [self.root.sort(x[i]) for i in range(x.shape[0])]

class RandomForest:

def __init__(self, n_classifiers=30):

self.n_classifiers = n_classifiers

self.classifiers = []

self.x = None

self.y = None

def build_tree(self, tree):

n = len(self.y) # n for bootstrap sampling size

# n = int(n * 0.5)

x, y = resample(self.x, self.y, n_samples=n) # bootstrap sampling

tree.train(x, y)

return tree # return tree for multiprocessing pool

def fit(self, x, y):

self.x, self.y = x, y

n_select_features = int(np.sqrt(x.shape[1])) # number of features

for i in range(self.n_classifiers):

tree = ClassifierTree(n_select_features)

self.classifiers.append(tree)

# multiprocessing pool

pool = Pool()

self.classifiers = pool.map(self.build_tree, self.classifiers)

pool.close()

pool.join()

def predict(self, x_test): # ensemble

pred = []

for tree in self.classifiers:

print(self.classifiers)

y_pred = tree.classify(x_test)

pred.append(y_pred)

pred = np.array(pred)

result = [Counter(pred[:, i]).most_common()[0][0] for i in range(pred.shape[1])]

return result

def test():

start_time = time.time()

# It's a continous dataset, only numerical feature values

df = pd.read_csv('waveform.data', header=None, sep=',')

print("hello0")

data = df.values

x = data[:, :-1]

y = data[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

rf = RandomForest(n_classifiers=30)

print("hello1")

rf.fit(x_train, y_train)

print("hello2")

y_pred = rf.predict(x_test)

print("hello")

acc = accuracy_score(y_test, y_pred)

print('RF:', acc)

print("— Running time: %.6f seconds —" % (time.time() – start_time))

if __name__ == "__main__":

test()

gives error .thank adv

File "C:/Users/Rama/rf.py", line 168, in predict

y_pred = tree.classify(x_test)

AttributeError: 'str' object has no attribute 'classify'

Sorry, I don’t have the capacity to debug your code. I explain more here:

https://machinelearningmastery.com/faq/single-faq/can-you-read-review-or-debug-my-code

Hi Jason,

I need to predict the order cancellation probability based upon some 6 independent variables. Now, through .coef_ function I can find out the coefficient values for each independent variables, but that will be for the whole dataset. I wan’t to find out that, if my model is predicting that order will get cancel than what is the reason behind it, I mean which parameter is the most prominent reason for the cancellation of that particular order.

Perhaps use a decision tree model that can explain the predictions it makes?

Hi Jason,

I have to print top three probabilities and map to class labels and compare with actual for multi class classification can u pls provide me the code.

Thanks in adv.

Regards,

NaVeen

What problem are you having exactly?

Hi Jason,

I have a project how is making a prediction of availability places of parking, I have in my database a geolocaloisation point, please can you help me to say how I can use this data to predict and how I begin, what kind of algorithm I will use.

Thanks in adv

Regards,

Nassim.

I recommend this process:

http://machinelearningmastery.com/how-to-define-your-machine-learning-problem/

Hi Jason,

I am trying to test my model on unseen data from another file, so far I have my main Python file below;

file1.py

And on my test file for unseen data I have;

from file1 import model_test

y_new = model_test.predict(X_new)

model_test(X_new, y_new)

It is saying that it cannot find reference predict in function, would you know how to get around this issue? Thank you.

You do not need grid search to make a prediction, grid search is only used to find a configuration for your final model.

See this post:

https://machinelearningmastery.com/train-final-machine-learning-model/

This makes sense to me now. Thanks for getting back to me Jason!

No problem.

hi Jason,

thanks for excellent post. i was wondering after predicting value on already known y, how to use regression model to predict unknown y for given x set. in your coding, Make_regression – does it really use my actual data point? or it is a random regression based on random data points depending upon number of sample, features etc.

Suppose i have built a model and predicted number of immigrants basis different x features. now i want to predict number of immigrants given new set of feature value. Do you think Make-regression will be the right option? The reason i am asking this is the function doesn’t ask about my X and y – am i wrong here? Your early reply will be really appreciated

If you change the features – the inputs – then you must fit a new model to map those inputs to the desired outputs.

Does that help?

Hi,

Thanks for this great tutorial, the new version show a warning, so it requires to set solver param:

model = LogisticRegression(solver="lbfgs")

```

Thanks.

Hi Jason,

I want my prediction to generate the name of the class instead of numbers. How can I do so? For example, in the case of the iris flowers, how can I input the data and make the model predict the number after I have trained and validated the model.

Each class is mapped to an integer using a LabelEncoder.

You can use inverse_transform to map integers batch to class names.

Hi Jason,

How to solve this Error

# All needed imports

import pandas as pd

import numpy as np

import matplotlib.mlab as mlab

import matplotlib.pyplot as plt

import seaborn as sns

from surprise import *

from surprise.accuracy import rmse, mae, fcp

from UserDefinedAlgorithm import PredictMean

from sklearn.grid_search import ParameterGrid

import pickle

from os import listdir

—————————————————————————

ImportError Traceback (most recent call last)

in ()

7 from surprise import *

8 from surprise.accuracy import rmse, mae, fcp

—-> 9 from UserDefinedAlgorithm import PredictMean

10 from sklearn.grid_search import ParameterGrid

11 import pickle

ImportError: No module named UserDefinedAlgorithm

I have not seen the error or the surprise library before, perhaps try posting to stackoverflow?

What if the X_train and X_new are of different sizes

I’m getting the following error after using Tfidf with similar parameters on X_new.

X has 265 features per sample; expecting 73

Input samples tot he model must always have the same size.

This may require that you preserve and reuse the same data preparation procedures/objects.

Hi Jason,

I have created a random forest model and evaluated the model using confusion matrix.

Now i want to predict the output by supplying some specific independent variable values

Ex: Age, Estimated Salary, Gender_Male, Gender_Female and ‘Purchased’ are my columns.

now i want to predict the output (Purchased) when the input variable are Age==35, Gender_Male=1 and Estimated salary= 40000

Can you please help me with the code.

Thanks in advance!

Regards,

Pramod Kumar

The examples in the above tutorial will provide a useful starting point.

What problem are you having precisely?

Hi Jason

I have stock market data with some features let’s say date,open,high,low,close,F1,F2,F3

my x_train is the data without ‘close’ column, and my y_train is the ‘close’ column

same for the x_test, and y_test

now when I do LinearRegression.fit(x_train,y_train) then predicted_y = LinearRegression.predict(x_test)

I get good results and near my y_test.

my problem is if I want to make prediction for tommorow, I don’t have any feature, all the columns are unknown, so how can I make a prediction?

thanks

You can make a prediction as follows:

yhat = model.predict(newX)

Where newX is the input data required by the model as you designed it to make a one-step prediction.

I engineered a bunch of features in my X_train and fit a model on X_train and y_train. Now when I want to predict on X_test, why does X_test have to have the same columns as X_train? I don’t understand this. If the the model learns from the past, why does it have to have those columns in the testing set?

Yes, you are defining a model to take specific input features (columns) that must be consistent during training and testing.

Perhaps this will help:

http://machinelearningmastery.com/how-machine-learning-algorithms-work/