Last Updated on August 19, 2020

Fitting a model to a training dataset is so easy today with libraries like scikit-learn.

A model can be fit and evaluated on a dataset in just a few lines of code. It is so easy that it has become a problem.

The same few lines of code are repeated again and again and it may not be obvious how to actually use the model to make a prediction. Or, if a prediction is made, how to relate the predicted values to the actual input values.

I know that this is the case because I get many emails with the question:

How do I connect the predicted values with the input data?

This a common problem.

In this tutorial, you will discover how to relate the predicted values with the inputs to a machine learning model.

After completing this tutorial, you will know:

- How to fit and evaluate the model on a training dataset.
- How to use the fit model to make predictions one at a time and in batches.
- How to connect the predicted values with the inputs to the model.

**Kick-start your project** with my new book Machine Learning Mastery With Python, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

**Update Jan/2020**: Updated for changes in scikit-learn v0.22 API.

## Tutorial Overview

This tutorial is divided into three parts; they are:

- Prepare a Training Dataset
- How to Fit a Model on the Training Dataset
- How to Connect Predictions With Inputs to the Model

## Prepare a Training Dataset

Let’s start off by defining a dataset that we can use with our model.

You may have your own dataset in a CSV file or in a NumPy array in memory.

In this case, we will use a simple two-class or binary classification problem with two numerical input variables.

**Inputs**: Two numerical input variables:**Outputs**: A class label as either a 0 or 1.

We can use the make_blobs() scikit-learn function to create this dataset with 1,000 examples.

The example below creates the dataset with separate arrays for the input (*X*) and outputs (*y*).

1 2 3 4 5 6 |
# example of creating a test dataset from sklearn.datasets import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # summarize the shape of the arrays print(X.shape, y.shape) |

Running the example creates the dataset and prints the shape of each of the arrays.

We can see that there are 1,000 rows for the 1,000 samples in the dataset. We can also see that the input data has two columns for the two input variables and that the output array is one long array of class labels for each of the rows in the input data.

1 |
(1000, 2) (1000,) |

Next, we will fit a model on this training dataset.

## How to Fit a Model on the Training Dataset

Now that we have a training dataset, we can fit a model on the data.

This means that we will provide all of the training data to a learning algorithm and let the learning algorithm to discover the mapping between the inputs and the output class label that minimizes the prediction error.

In this case, because it is a two-class problem, we will try the logistic regression classification algorithm.

This can be achieved using the LogisticRegression class from scikit-learn.

First, the model must be defined with any specific configuration we require. In this case, we will use the efficient ‘*lbfgs*‘ solver.

Next, the model is fit on the training dataset by calling the *fit()* function and passing in the training dataset.

Finally, we can evaluate the model by first using it to make predictions on the training dataset by calling *predict()* and then comparing the predictions to the expected class labels and calculating the accuracy.

The complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# fit a logistic regression on the training dataset from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_blobs from sklearn.metrics import accuracy_score # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # define model model = LogisticRegression(solver='lbfgs') # fit model model.fit(X, y) # make predictions yhat = model.predict(X) # evaluate predictions acc = accuracy_score(y, yhat) print(acc) |

Running the example fits the model on the training dataset and then prints the classification accuracy.

In this case, we can see that the model has a 100% classification accuracy on the training dataset.

1 |
1.0 |

Now that we know how to fit and evaluate a model on the training dataset, let’s get to the root of the question.

*How do you connect inputs of the model to the outputs?*

## How to Connect Predictions With Inputs to the Model

A fit machine learning model takes inputs and makes a prediction.

This could be one row of data at a time; for example:

**Input**: 2.12309797 -1.41131072**Output**: 1

This is straightforward with our model.

For example, we can make a prediction with an array input and get one output and we know that the two are directly connected.

The input must be defined as an array of numbers, specifically 1 row with 2 columns. We can achieve this by defining the example as a list of rows with a list of columns for each row; for example:

1 2 3 |
... # define input new_input = [[2.12309797, -1.41131072]] |

We can then provide this as input to the model and make a prediction.

1 2 3 |
... # get prediction for new input new_output = model.predict(new_input) |

Tying this together with fitting the model from the previous section, the complete example is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# make a single prediction with the model from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # define model model = LogisticRegression(solver='lbfgs') # fit model model.fit(X, y) # define input new_input = [[2.12309797, -1.41131072]] # get prediction for new input new_output = model.predict(new_input) # summarize input and output print(new_input, new_output) |

Running the example defines the new input and makes a prediction, then prints both the input and the output.

We can see that in this case, the model predicts class label 1 for the inputs.

1 |
[[2.12309797, -1.41131072]] [1] |

If we were using the model in our own application, this usage of the model would allow us to directly relate the inputs and outputs for each prediction made.

If we needed to replace the labels 0 and 1 with something meaningful like “*spam*” and “*not spam*“, we could do that with a simple if-statement.

So far so good.

**What happens when the model is used to make multiple predictions at once?**

That is, how do we relate the predictions to the inputs when multiple rows or multiple samples are provided to the model at once?

For example, we could make a prediction for each of the 1,000 examples in the training dataset as we did in the previous section when evaluating the model. In this case, the model would make 1,000 distinct predictions and return an array of 1,000 integer values. One prediction for each of the 1,000 input rows of data.

Importantly, the order of the predictions in the output array matches the order of rows provided as input to the model when making a prediction. This means that the input row at index 0 matches the prediction at index 0; the same is true for index 1, index 2, all the way to index 999.

Therefore, we can relate the inputs and outputs directly based on their index, with the knowledge that the order is preserved when making a prediction on many rows of inputs.

Let’s make this concrete with an example.

First, we can make a prediction for each row of input in the training dataset:

1 2 3 |
... # make predictions on the entire training dataset yhat = model.predict(X) |

We can then step through the indexes and access the input and the predicted output for each.

This shows precisely how to connect the predictions with the input rows. For example, the input at row 0 and the prediction at index 0:

1 2 |
... print(X[0], yhat[0]) |

In this case, we will just look at the first 10 rows and their predictions.

1 2 3 4 |
... # connect predictions with outputs for i in range(10): print(X[i], yhat[i]) |

Tying this together, the complete example of making a prediction for each row in the training data and connecting the predictions with the inputs is listed below.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
# make a single prediction with the model from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_blobs # create the inputs and outputs X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2) # define model model = LogisticRegression(solver='lbfgs') # fit model model.fit(X, y) # make predictions on the entire training dataset yhat = model.predict(X) # connect predictions with outputs for i in range(10): print(X[i], yhat[i]) |

Running the example, the model makes 1,000 predictions for the 1,000 rows in the training dataset, then connects the inputs to the predicted values for the first 10 examples.

This provides a template that you can use and adapt for your own predictive modeling projects to connect predictions to the input rows via their row index.

1 2 3 4 5 6 7 8 9 10 |
[ 1.23839154 -2.8475005 ] 1 [-1.25884111 -8.57055785] 0 [ -0.86599821 -10.50446358] 0 [ 0.59831673 -1.06451727] 1 [ 2.12309797 -1.41131072] 1 [-1.53722693 -9.61845366] 0 [ 0.92194131 -0.68709327] 1 [-1.31478732 -8.78528161] 0 [ 1.57989896 -1.462412 ] 1 [ 1.36989667 -1.3964704 ] 1 |

## Further Reading

This section provides more resources on the topic if you are looking to go deeper.

### Posts

- Your First Machine Learning Project in Python Step-By-Step
- How to Make Predictions with scikit-learn

### APIs

- sklearn.datasets.make_blobs API
- sklearn.metrics.accuracy_score API
- sklearn.linear_model.LogisticRegression API

## Summary

In this tutorial, you discovered how to relate the predicted values with the inputs to a machine learning model.

Specifically, you learned:

- How to fit and evaluate the model on a training dataset.
- How to use the fit model to make predictions one at a time and in batches.
- How to connect the predicted values with the inputs to the model.

Do you have any questions?

Ask your questions in the comments below and I will do my best to answer.

Nice Article On Machine Learning

Thanks!

Nice article continue writing

Thanks!

Hi Jason! Have a question:

I’m doing a logistic regression for predicting churn. I’m using the hole sample of our customers with targets “1” for churned and 0 for “lives”, my questions are:

– I am training the model with all these customers and getting probabilities of churn as ouput, I can only apply this model to new records, right?

– I did a lot of pre-processing and scaling methods on the training data, I need to do all these steps also to my new input and then predict right?

Thanks!

You can use the model to make predictions on any data you like. It only makes sense to evaluate the model on new examples (a test set), and to use a final model to make predictions where you don’t know the answer.

Yes, all operations performed on the training dataset must be performed on new data.

Hi Jason! I have a different question about connecting input and output.

I have a neural network with 5 nodes on the output layer. The model is supose to predict a number, which can be (-2, -1, 0, 1 ,2).

So for each prediction i get an array like this:

[0.20181388, 0.19936344, 0.19890821, 0.19744289, 0.20247154]

Now how do i know what number each box represents? Because it looks like after one build, train and prediction, Box 0 represents the number 1, and after another full build, train and prediction Box 0 now represents the number -2. Whats the connection here?

Thank you!

If your model predicts one number per sample, then you have 5 samples worth of predictions.

If your model predicts 5 numbers per sample, then you have 1 sample worth of predictions.

Thank you for your reply!

Yes my model predicts one number per sample, but the array show above is just for one sample.

I probably should have specified that, but my model outputs lets say 100 test samples after training. So then i get a list, containing 100 lists, which each again contains 5 float numbers.

So my problem is that for each sample, i have no idea what these 5 prediction numbers represents.

They represent what ever you trained your model to predict.

Hi,

I have prepared a Linear Regression model with X=(x1,x2,x3) and Y=y1, now I want to predict x1. How should I do it?

Call model.predict().

I have prepared a Linear Regression model with inpute feature=(x1). x1 has 300 values of datas at function of time like for date 1-1-2000 it was 250, 1-2-2000 it is 247, 1-3-2000 it is 263 AND So on.I want to predict x1 as a function of time in future. How should I do it?

Call the predict function on your model.

Thanks. really good explanation of one important question.

Thank you!

hi, i want to make a regression model predict on just a single column of data. 1 feature just the time series of values and i want to predict that, how would i use that in a neural network?

Call model.predict() with one same as input:

https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/

i made a model with random forest classifier predicticted my test set using 3 features from train now i want to predict new data by giving 3 new/different values for these features but getting an error

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: ‘habitat’

what m i doing wrong ?

Sorry, I have not seen that error before. You could try searching on stackoverflow?

Perhaps this tutorial will help you load your data:

http://machinelearningmastery.com/load-machine-learning-data-python/

Another great article! I have a question, for the following scenario.

Let’s assume that we have data with 20 features and we have already done the following:

1) we have studied the data, 2) made all the necessary transformations 3) scaling 4) extra feature engineering 5) feature importance inspection 6) manual and/or automatic feature selection.

And we have ended up with 5 features, so before training we use PCA to further reduce dimensionality to let’s say two components. And we finally train some models, we evaluate them and we select one for “production”.

So, now comes the time when a client sends us a new sample to predict its class. The client has information based on the initial schema, for every sample he can get us the value for those 20 features. So we can always a list of the (5) features we used to feed PCA but I guess that our trained model would expect two values.

So, the question is how do we recreate the effects of PCA for new samples that we want to predict their target class?

Yes, you keep all data transforms and use them to prepare all new data prior to modeling.

For more on this see:

https://machinelearningmastery.com/how-to-save-and-load-models-and-data-preparation-in-scikit-learn-for-later-use/

if I have 3 color signals (blue, red, green) displaying one by one, how can we predict the next outcome color through machine learning? and which algorithm we needed for this?

Please Answer……………………..

This sounds like sequence classification, perhaps start with the models listed here:

https://machinelearningmastery.com/start-here/#deep_learning_time_series

Hi Jason, I developed a classification model where the target variable is multi class with a set of group names. I encoded the target variable using pd.get_dummies. Now when I test the model against a new text description, I get the output as a numeric vector. How do I convert this vector back to the group name? Can you please help?

Perhaps use a onehotencoder instead and then call inverse_transform() on the predictions to get the strings, see this:

https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

Hey man, awesome article.

I’m just not so sure on how to use the data after the prediction.

Because I’m not using the customer_id column on my model, as it’s a feature that can lower my model’s accuracy.

So when I use my ‘new_input’ data to the model as a parameter, it also doesn’t have the customer_id.

So how to know which of my customers is related to one prediction or the other?

Thanks!

Remove the customer id, but keep the array. The index (row number) of the id will match the index of an input to the model and the prediction from the model. This index will link all 3 pieces of information.

E.g. the customer id at row 0, the input to the model at row 0, the prediction at row 0 are all linked.

Hi Jason, amazing site, thank you!

How do you match the customer id if you need to drop it before using pipeline to fit_resample due to imbalanced data? Doesn’t the over/under sampling mean the indexes won’t match?

Good question, I answer it here:

https://machinelearningmastery.com/how-to-connect-model-input-data-with-predictions-for-machine-learning/

Thanks for your reply Jason but the link is for this page, can you please confirm? Thanks again!

Right. Does the above example not help and show you how to connect id’s via row indexes?

Which part is confusing?

Hey Nice article but how to predict from my own dataset? Like I have a dataset in a excel file and after applying logistic regression on the dataset I am getting an accuracy of 80%. Now I want to give some inputs and want to predict the output how should I write the code?

Yes, see this:

https://machinelearningmastery.com/make-predictions-scikit-learn/

Is it possible to write single code for all type of datasets .I mean irrespective of the data(classification, regression, categorical, missing, normalization….etc). The code is independent on dataset work on every data that is given,..

Not really, each dataset is different.

The best we could do along these lines is automl:

https://machinelearningmastery.com/automl-libraries-for-python/