How to Connect Model Input Data With Predictions for Machine Learning

By Jason Brownlee on August 19, 2020 in Python Machine Learning 88

Fitting a model to a training dataset is so easy today with libraries like scikit-learn.

A model can be fit and evaluated on a dataset in just a few lines of code. It is so easy that it has become a problem.

The same few lines of code are repeated again and again and it may not be obvious how to actually use the model to make a prediction. Or, if a prediction is made, how to relate the predicted values to the actual input values.

I know that this is the case because I get many emails with the question:

How do I connect the predicted values with the input data?

This a common problem.

In this tutorial, you will discover how to relate the predicted values with the inputs to a machine learning model.

After completing this tutorial, you will know:

How to fit and evaluate the model on a training dataset.
How to use the fit model to make predictions one at a time and in batches.
How to connect the predicted values with the inputs to the model.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Jan/2020: Updated for changes in scikit-learn v0.22 API.

How to Connect Model Input Data With Predictions for Machine Learning
Photo by Ian D. Keating, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Prepare a Training Dataset
How to Fit a Model on the Training Dataset
How to Connect Predictions With Inputs to the Model

Prepare a Training Dataset

Let’s start off by defining a dataset that we can use with our model.

You may have your own dataset in a CSV file or in a NumPy array in memory.

In this case, we will use a simple two-class or binary classification problem with two numerical input variables.

Inputs: Two numerical input variables:
Outputs: A class label as either a 0 or 1.

We can use the make_blobs() scikit-learn function to create this dataset with 1,000 examples.

The example below creates the dataset with separate arrays for the input (X) and outputs (y).

# example of creating a test dataset
from sklearn.datasets import make_blobs
# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)
# summarize the shape of the arrays
print(X.shape, y.shape)

# example of creating a test dataset

from sklearn.datasets import make_blobs

# create the inputs and outputs

X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)

# summarize the shape of the arrays

print(X.shape, y.shape)

Running the example creates the dataset and prints the shape of each of the arrays.

We can see that there are 1,000 rows for the 1,000 samples in the dataset. We can also see that the input data has two columns for the two input variables and that the output array is one long array of class labels for each of the rows in the input data.

(1000, 2) (1000,)

1	(1000, 2) (1000,)

Next, we will fit a model on this training dataset.

How to Fit a Model on the Training Dataset

Now that we have a training dataset, we can fit a model on the data.

This means that we will provide all of the training data to a learning algorithm and let the learning algorithm to discover the mapping between the inputs and the output class label that minimizes the prediction error.

In this case, because it is a two-class problem, we will try the logistic regression classification algorithm.

This can be achieved using the LogisticRegression class from scikit-learn.

First, the model must be defined with any specific configuration we require. In this case, we will use the efficient ‘lbfgs‘ solver.

Next, the model is fit on the training dataset by calling the fit() function and passing in the training dataset.

Finally, we can evaluate the model by first using it to make predictions on the training dataset by calling predict() and then comparing the predictions to the expected class labels and calculating the accuracy.

The complete example is listed below.

# fit a logistic regression on the training dataset
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
from sklearn.metrics import accuracy_score
# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)
# define model
model = LogisticRegression(solver='lbfgs')
# fit model
model.fit(X, y)
# make predictions
yhat = model.predict(X)
# evaluate predictions
acc = accuracy_score(y, yhat)
print(acc)

# fit a logistic regression on the training dataset

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

from sklearn.metrics import accuracy_score

# create the inputs and outputs

X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)

# define model

model = LogisticRegression(solver='lbfgs')

# fit model

model.fit(X, y)

# make predictions

yhat = model.predict(X)

# evaluate predictions

acc = accuracy_score(y, yhat)

print(acc)

Running the example fits the model on the training dataset and then prints the classification accuracy.

In this case, we can see that the model has a 100% classification accuracy on the training dataset.

1.0

1.0

Now that we know how to fit and evaluate a model on the training dataset, let’s get to the root of the question.

How do you connect inputs of the model to the outputs?

How to Connect Predictions With Inputs to the Model

A fit machine learning model takes inputs and makes a prediction.

This could be one row of data at a time; for example:

Input: 2.12309797 -1.41131072
Output: 1

This is straightforward with our model.

For example, we can make a prediction with an array input and get one output and we know that the two are directly connected.

The input must be defined as an array of numbers, specifically 1 row with 2 columns. We can achieve this by defining the example as a list of rows with a list of columns for each row; for example:

...
# define input
new_input = [[2.12309797, -1.41131072]]

...

# define input

new_input = [[2.12309797, -1.41131072]]

We can then provide this as input to the model and make a prediction.

...
# get prediction for new input
new_output = model.predict(new_input)

...

# get prediction for new input

new_output = model.predict(new_input)

Tying this together with fitting the model from the previous section, the complete example is listed below.

# make a single prediction with the model
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)
# define model
model = LogisticRegression(solver='lbfgs')
# fit model
model.fit(X, y)
# define input
new_input = [[2.12309797, -1.41131072]]
# get prediction for new input
new_output = model.predict(new_input)
# summarize input and output
print(new_input, new_output)

# make a single prediction with the model

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

# create the inputs and outputs

X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)

# define model

model = LogisticRegression(solver='lbfgs')

# fit model

model.fit(X, y)

# define input

new_input = [[2.12309797, -1.41131072]]

# get prediction for new input

new_output = model.predict(new_input)

# summarize input and output

print(new_input, new_output)

Running the example defines the new input and makes a prediction, then prints both the input and the output.

We can see that in this case, the model predicts class label 1 for the inputs.

[[2.12309797, -1.41131072]] [1]

1	[[2.12309797, -1.41131072]] [1]

If we were using the model in our own application, this usage of the model would allow us to directly relate the inputs and outputs for each prediction made.

If we needed to replace the labels 0 and 1 with something meaningful like “spam” and “not spam“, we could do that with a simple if-statement.

So far so good.

What happens when the model is used to make multiple predictions at once?

That is, how do we relate the predictions to the inputs when multiple rows or multiple samples are provided to the model at once?

For example, we could make a prediction for each of the 1,000 examples in the training dataset as we did in the previous section when evaluating the model. In this case, the model would make 1,000 distinct predictions and return an array of 1,000 integer values. One prediction for each of the 1,000 input rows of data.

Importantly, the order of the predictions in the output array matches the order of rows provided as input to the model when making a prediction. This means that the input row at index 0 matches the prediction at index 0; the same is true for index 1, index 2, all the way to index 999.

Therefore, we can relate the inputs and outputs directly based on their index, with the knowledge that the order is preserved when making a prediction on many rows of inputs.

Let’s make this concrete with an example.

First, we can make a prediction for each row of input in the training dataset:

...
# make predictions on the entire training dataset
yhat = model.predict(X)

...

# make predictions on the entire training dataset

yhat = model.predict(X)

We can then step through the indexes and access the input and the predicted output for each.

This shows precisely how to connect the predictions with the input rows. For example, the input at row 0 and the prediction at index 0:

...
print(X[0], yhat[0])

1 2	... print(X[0], yhat[0])

In this case, we will just look at the first 10 rows and their predictions.

...
# connect predictions with outputs
for i in range(10):
	print(X[i], yhat[i])

...

# connect predictions with outputs

for i in range(10):

print(X[i], yhat[i])

Tying this together, the complete example of making a prediction for each row in the training data and connecting the predictions with the inputs is listed below.

# make a single prediction with the model
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_blobs
# create the inputs and outputs
X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)
# define model
model = LogisticRegression(solver='lbfgs')
# fit model
model.fit(X, y)
# make predictions on the entire training dataset
yhat = model.predict(X)
# connect predictions with outputs
for i in range(10):
	print(X[i], yhat[i])

# make a single prediction with the model

from sklearn.linear_model import LogisticRegression

from sklearn.datasets import make_blobs

# create the inputs and outputs

X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)

# define model

model = LogisticRegression(solver='lbfgs')

# fit model

model.fit(X, y)

# make predictions on the entire training dataset

yhat = model.predict(X)

# connect predictions with outputs

for i in range(10):

print(X[i], yhat[i])

Running the example, the model makes 1,000 predictions for the 1,000 rows in the training dataset, then connects the inputs to the predicted values for the first 10 examples.

This provides a template that you can use and adapt for your own predictive modeling projects to connect predictions to the input rows via their row index.

[ 1.23839154 -2.8475005 ] 1
[-1.25884111 -8.57055785] 0
[ -0.86599821 -10.50446358] 0
[ 0.59831673 -1.06451727] 1
[ 2.12309797 -1.41131072] 1
[-1.53722693 -9.61845366] 0
[ 0.92194131 -0.68709327] 1
[-1.31478732 -8.78528161] 0
[ 1.57989896 -1.462412  ] 1
[ 1.36989667 -1.3964704 ] 1

[ 1.23839154 -2.8475005 ] 1

[-1.25884111 -8.57055785] 0

[ -0.86599821 -10.50446358] 0

[ 0.59831673 -1.06451727] 1

[ 2.12309797 -1.41131072] 1

[-1.53722693 -9.61845366] 0

[ 0.92194131 -0.68709327] 1

[-1.31478732 -8.78528161] 0

[ 1.57989896 -1.462412 ] 1

[ 1.36989667 -1.3964704 ] 1

Summary

In this tutorial, you discovered how to relate the predicted values with the inputs to a machine learning model.

Specifically, you learned:

How to fit and evaluate the model on a training dataset.
How to use the fit model to make predictions one at a time and in batches.
How to connect the predicted values with the inputs to the model.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

88 Responses to How to Connect Model Input Data With Predictions for Machine Learning

kp836453 November 22, 2019 at 6:13 pm #

Nice Article On Machine Learning

Reply
- Jason Brownlee November 23, 2019 at 6:48 am #
  
  Thanks!
  
  Reply
Tal December 5, 2019 at 5:05 am #

Nice article continue writing

Reply
- Jason Brownlee December 5, 2019 at 6:43 am #
  
  Thanks!
  
  Reply
  - Kareem March 20, 2021 at 4:53 pm #
    
    If dataset has columns containing strings. How to fit.
    
    Reply
    - Jason Brownlee March 21, 2021 at 6:07 am #
      
      You can convert the strings to numbers. e.g. one hot encoding for categories, bag of words for free text.
      
      Reply
Gabriel February 6, 2020 at 9:01 am #

Hi Jason! Have a question:

I’m doing a logistic regression for predicting churn. I’m using the hole sample of our customers with targets “1” for churned and 0 for “lives”, my questions are:

– I am training the model with all these customers and getting probabilities of churn as ouput, I can only apply this model to new records, right?
– I did a lot of pre-processing and scaling methods on the training data, I need to do all these steps also to my new input and then predict right?

Thanks!

Reply
- Jason Brownlee February 6, 2020 at 1:45 pm #
  
  You can use the model to make predictions on any data you like. It only makes sense to evaluate the model on new examples (a test set), and to use a final model to make predictions where you don’t know the answer.
  
  Yes, all operations performed on the training dataset must be performed on new data.
  
  Reply
- Tumi Sebela March 20, 2021 at 5:01 pm #
  
  To make things simple, use scikit pipeline to chain together all your pre-processing steps
  
  Reply
Mikkel Andreas Kvande April 24, 2020 at 10:14 pm #

Hi Jason! I have a different question about connecting input and output.

I have a neural network with 5 nodes on the output layer. The model is supose to predict a number, which can be (-2, -1, 0, 1 ,2).

So for each prediction i get an array like this:
[0.20181388, 0.19936344, 0.19890821, 0.19744289, 0.20247154]

Now how do i know what number each box represents? Because it looks like after one build, train and prediction, Box 0 represents the number 1, and after another full build, train and prediction Box 0 now represents the number -2. Whats the connection here?

Thank you!

Reply
- Jason Brownlee April 25, 2020 at 6:49 am #
  
  If your model predicts one number per sample, then you have 5 samples worth of predictions.
  
  If your model predicts 5 numbers per sample, then you have 1 sample worth of predictions.
  
  Reply
  - Mikkel Andreas Kvande April 28, 2020 at 5:30 pm #
    
    Thank you for your reply!
    
    Yes my model predicts one number per sample, but the array show above is just for one sample.
    
    I probably should have specified that, but my model outputs lets say 100 test samples after training. So then i get a list, containing 100 lists, which each again contains 5 float numbers.
    
    So my problem is that for each sample, i have no idea what these 5 prediction numbers represents.
    
    Reply
    - Jason Brownlee April 29, 2020 at 6:22 am #
      
      They represent what ever you trained your model to predict.
      
      Reply
Pooja May 7, 2020 at 9:32 pm #

Hi,

I have prepared a Linear Regression model with X=(x1,x2,x3) and Y=y1, now I want to predict x1. How should I do it?

Reply
- Jason Brownlee May 8, 2020 at 6:31 am #
  
  Call model.predict().
  
  Reply
  - Aditya Bothra August 19, 2020 at 1:03 am #
    
    I have prepared a Linear Regression model with inpute feature=(x1). x1 has 300 values of datas at function of time like for date 1-1-2000 it was 250, 1-2-2000 it is 247, 1-3-2000 it is 263 AND So on.I want to predict x1 as a function of time in future. How should I do it?
    
    Reply
    - Jason Brownlee August 19, 2020 at 6:03 am #
      
      Call the predict function on your model.
      
      Reply
vanitha August 17, 2020 at 5:59 pm #

Thanks. really good explanation of one important question.

Reply
- Jason Brownlee August 18, 2020 at 5:59 am #
  
  Thank you!
  
  Reply
Aditya Bothra August 19, 2020 at 12:58 am #

hi, i want to make a regression model predict on just a single column of data. 1 feature just the time series of values and i want to predict that, how would i use that in a neural network?

Reply
- Jason Brownlee August 19, 2020 at 6:02 am #
  
  Call model.predict() with one same as input:
  https://machinelearningmastery.com/how-to-make-classification-and-regression-predictions-for-deep-learning-models-in-keras/
  
  Reply
Haroon khan August 19, 2020 at 7:36 pm #

i made a model with random forest classifier predicticted my test set using 3 features from train now i want to predict new data by giving 3 new/different values for these features but getting an error
pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas/_libs/index_class_helper.pxi in pandas._libs.index.Int64Engine._check_type()

KeyError: ‘habitat’

what m i doing wrong ?

Reply
- Jason Brownlee August 20, 2020 at 6:40 am #
  
  Sorry, I have not seen that error before. You could try searching on stackoverflow?
  
  Perhaps this tutorial will help you load your data:
  https://machinelearningmastery.com/load-machine-learning-data-python/
  
  Reply
Christos Karapapas September 23, 2020 at 4:53 am #

Another great article! I have a question, for the following scenario.

Let’s assume that we have data with 20 features and we have already done the following:
1) we have studied the data, 2) made all the necessary transformations 3) scaling 4) extra feature engineering 5) feature importance inspection 6) manual and/or automatic feature selection.

And we have ended up with 5 features, so before training we use PCA to further reduce dimensionality to let’s say two components. And we finally train some models, we evaluate them and we select one for “production”.

So, now comes the time when a client sends us a new sample to predict its class. The client has information based on the initial schema, for every sample he can get us the value for those 20 features. So we can always a list of the (5) features we used to feed PCA but I guess that our trained model would expect two values.

So, the question is how do we recreate the effects of PCA for new samples that we want to predict their target class?

Reply
- Jason Brownlee September 23, 2020 at 6:44 am #
  
  Yes, you keep all data transforms and use them to prepare all new data prior to modeling.
  
  For more on this see:
  https://machinelearningmastery.com/how-to-save-and-load-models-and-data-preparation-in-scikit-learn-for-later-use/
  
  Reply
Shivam Kumar September 25, 2020 at 5:02 am #

if I have 3 color signals (blue, red, green) displaying one by one, how can we predict the next outcome color through machine learning? and which algorithm we needed for this?
Please Answer……………………..

Reply
- Jason Brownlee September 25, 2020 at 6:40 am #
  
  This sounds like sequence classification, perhaps start with the models listed here:
  https://machinelearningmastery.com/start-here/#deep_learning_time_series
  
  Reply
Sriram September 26, 2020 at 1:25 am #

Hi Jason, I developed a classification model where the target variable is multi class with a set of group names. I encoded the target variable using pd.get_dummies. Now when I test the model against a new text description, I get the output as a numeric vector. How do I convert this vector back to the group name? Can you please help?

Reply
- Jason Brownlee September 26, 2020 at 6:21 am #
  
  Perhaps use a onehotencoder instead and then call inverse_transform() on the predictions to get the strings, see this:
  https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/
  
  Reply
Thomas October 1, 2020 at 5:30 am #

Hey man, awesome article.
I’m just not so sure on how to use the data after the prediction.

Because I’m not using the customer_id column on my model, as it’s a feature that can lower my model’s accuracy.
So when I use my ‘new_input’ data to the model as a parameter, it also doesn’t have the customer_id.
So how to know which of my customers is related to one prediction or the other?

Reply
- Jason Brownlee October 1, 2020 at 6:35 am #
  
  Thanks!
  
  Remove the customer id, but keep the array. The index (row number) of the id will match the index of an input to the model and the prediction from the model. This index will link all 3 pieces of information.
  
  E.g. the customer id at row 0, the input to the model at row 0, the prediction at row 0 are all linked.
  
  Reply
  - Sam December 16, 2020 at 8:54 am #
    
    Hi Jason, amazing site, thank you!
    
    How do you match the customer id if you need to drop it before using pipeline to fit_resample due to imbalanced data? Doesn’t the over/under sampling mean the indexes won’t match?
    
    Reply
    - Jason Brownlee December 16, 2020 at 1:39 pm #
      
      Good question, I answer it here:
      https://machinelearningmastery.com/how-to-connect-model-input-data-with-predictions-for-machine-learning/
      
      Reply
      - Sam December 17, 2020 at 6:43 am #
        
        Thanks for your reply Jason but the link is for this page, can you please confirm? Thanks again!
      - Jason Brownlee December 17, 2020 at 7:45 am #
        
        Right. Does the above example not help and show you how to connect id’s via row indexes?
        
        Which part is confusing?
Aanchal Saraswat October 1, 2020 at 4:14 pm #

Hey Nice article but how to predict from my own dataset? Like I have a dataset in a excel file and after applying logistic regression on the dataset I am getting an accuracy of 80%. Now I want to give some inputs and want to predict the output how should I write the code?

Reply
- Jason Brownlee October 2, 2020 at 5:53 am #
  
  Yes, see this:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
Sravan November 3, 2020 at 5:24 pm #

Is it possible to write single code for all type of datasets .I mean irrespective of the data(classification, regression, categorical, missing, normalization….etc). The code is independent on dataset work on every data that is given,..

Reply
- Jason Brownlee November 4, 2020 at 6:36 am #
  
  Not really, each dataset is different.
  
  The best we could do along these lines is automl:
  https://machinelearningmastery.com/automl-libraries-for-python/
  
  Reply
Israel March 11, 2021 at 8:45 am #

Hello Dear,
I am working on a sensor device that acquires datasets from sample substances and with machine learning algorithm I predict the substance. I am using Decision Tree of Scikit Learn. But I noticed that if I acquire datasets that have very close values, the sensor finds it difficult to accurately differentiate between the datesets and so it begins to make wrong predictions. For instance if I acquire datasets such as the ones below:

v1,v2,v3,v4,v5,Target
12,17,2,20,105,Paper
11,16,2,21,103,Tissue
13,18,3,21,102,Carton

and I test the sensor with a dataset such as

13,19,3,21,102

instead of predicting Carton, it could predict Tissue or Paper.

In this situation, what should I do to improve on its ability to differentiate between such values? Is there an algorithm I need to employ?
Thank you.

Reply
- Jason Brownlee March 11, 2021 at 1:27 pm #
  
  Good question, this might give you some ideas:
  https://machinelearningmastery.com/faq/single-faq/what-algorithm-config-should-i-use
  
  And this checklist:
  https://machinelearningmastery.com/how-to-improve-machine-learning-results/
  
  Reply
akurt March 17, 2021 at 1:19 am #

Thank you for this great article.
What if I scaled my dataset? Should I provide the input in raw data format or should I scale it first? In my case, my model fails when provide my input in raw data format but when I provide a scaled form it can predict successfully.

Reply
- Jason Brownlee March 17, 2021 at 6:09 am #
  
  Evaluate the performance of your model with and without scaling and use the approach that works the best.
  
  Reply
Priyanka March 20, 2021 at 12:53 am #

Hi Jason,

Thanks for your post.. good read
Any idea how to link model with the GUI created in Tkinter for predicting output basis input shared through GUI

Reply
- Jason Brownlee March 20, 2021 at 5:24 am #
  
  Sounds like a software engineering question, perhaps you can post your code and question to stackoverflow.
  
  Reply
Ganpat Patel March 24, 2021 at 8:34 pm #

Hi Jason
Great article!
I had a doubt, it would be really nice of you if you help me out.

I am predicting the “User Rating” of books. In this model, the key feature can be the “Author” name. Now author name is in a string. So, first i used Label Encoder and it assigned each author name with a unique number. But now, when i take the new input to which i want to predict the value. The problem is that, author name will be a “string” in the input. So now how will i know that which number is assigned to that particular author when i fitted the model?
To get rid out of thus problem, i first add my new input features in my existing dataframe and then again i used the Label Encoder. But, in this approach i get my prediction wrong if the author is a new one who do not have any historical data. The Label encoder alters the numbers assigned to different names when we use it 2nd time after adding our input.

What can be the approach to solve such problems?
please help…

Reply
- Jason Brownlee March 25, 2021 at 4:43 am #
  
  Some ideas to explore:
  
  Perhaps you can prepare all known authors in your domain?
  Perhaps you can use a bag of words encoding?
  Perhaps you can use a word embedding for authors?
  Perhaps you can add to your encoding as new authors are encountered?
  Perhaps you can remove author from your model or confirm that it is predictive?
  Perhaps you can check the literature for common solutions?
  
  Reply
Anass March 26, 2021 at 11:22 am #

hi Jason

thank you very much for your awsome courses.

I kindely ask for some help.

I have trained a machine learning model on multiple embeddings at a time and i want to test on just one embedding. The input shape of my model is (?, 10, 300) 10 embeddings with a dimension of 300. and I want to test on (?, 1, 300).

What can be the approach to solve such problems?
please help me.

Reply
- Jason Brownlee March 29, 2021 at 5:49 am #
  
  You’re welcome.
  
  Sorry, I don’t understand why you would have a separate embedding for train and test?
  
  Reply
Sara May 28, 2021 at 4:28 am #

Hi Jason,
I have a problem fitting a Machine learning model (BoostedTreesRegressor) to input features where the input features contain variables that must be trained (optimized) during training. I do not mean hyper parameters, I mean variables that gradually change similar to weights/bias in a neural network for example.
I think sklearn libraries are like a black box that only train the model parameters and there is no way to train additional variables or parameters . While, in Tensor flow 1, by using tf.Variable and sess.run we can train variables addition to regular model variables.
However, I can not find any way to use tf.Variable with BoostedTreesRegressor in tensor flow.
what do you think about possibility of training extra variables in python libraries? any help would be highly appreciated. Sara

Reply
- Jason Brownlee May 28, 2021 at 6:48 am #
  
  Perhaps fit a new model each time the data changes to overcome the drift in concepts.
  
  Reply
Abdul Munem May 30, 2021 at 5:03 am #

Hello Jason, great explanation but I have a bit of a weird question

I am training a credit card fraud detection model using a dataset from kaggle that is really popular with this kind of project, it has 31 features and only 3 of them are known which are: time elapsed since the first transaction in the dataset, amount of money, and transaction class (whether it’s fraud or legit) and there are 28 other features that the dataset author said are reduced using PCA to preserve user confidentiality. Thing is, the program I aim to connect this model to has fewer input features and I don’t know how to use my model to predict this.

This is for a university project so that’s why our app dataset that contains transaction info isn’t as packed as a real-life transaction would be, it also doesn’t help that the 28 aforementioned features are unknown for confidentiality reasons, any suggestions to handle this?

Reply
- Jason Brownlee May 30, 2021 at 5:54 am #
  
  Thanks!
  
  Not sure I completely understand, sorry.
  
  Typically a model must be tailored to the data available. If it was prepared with different data (different number of features), it is probably not appropriate.
  
  Perhaps you can use transfer learning if the model and new data are related.
  Perhaps you can impute missing values.
  
  Reply
Priyanshu July 15, 2021 at 5:11 pm #

Hi Jason, Great article
I have a question that how can we separate the values that we are getting for predictions like [ 0.23.77] I want [.23] in a separate column and [.77] in a different column. The reason being I want to train a new model based on the output predictions.

Reply
- Jason Brownlee July 16, 2021 at 5:22 am #
  
  Perhaps you can round your values?
  Perhaps you can use formatting to limit the precision when the values are displayed?
  
  Reply
Timothy J Fisher July 22, 2021 at 9:10 am #

Is there any way in R, to match the True Positive results back to the original data set in order to further examine attributes about that segment that was predicted correctly? I am using H2O and the H2O frames appear to drop unique identifiers that would enable a join back to the original training data set that contains the attributes I’d like to examine further.

Any suggestions?

Reply
- Jason Brownlee July 23, 2021 at 5:42 am #
  
  Yes, you will ned to write some custom code for this analysis.
  
  Reply
Anamika August 10, 2021 at 4:02 pm #

Hi,
Thanks for sharing above information, It’s really helpful . Can you please tell me if there is any way by which after fitting the input to our predicted model. We can calculate accuracy along with output for each input value. Like in above example
[-1.31478732 -8.78528161] 0
[ 1.57989896 -1.462412 ] 1
[ 1.36989667 -1.3964704 ] 1

I want to predict accuracy for each input data ?

Reply
- Adrian Tam August 11, 2021 at 6:38 am #
  
  Probably scikit-learn function https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html can help you.
  
  Reply
mimiya August 23, 2021 at 4:38 am #

perfect work , thank you.

how can i add a column ” prediction” which contain the prediction result to the test set .

Reply
- Adrian Tam August 23, 2021 at 5:20 am #
  
  May be you can get the prediction first, and then use numpy.hstack() function to “add” the column
  
  Reply
Ashok September 13, 2021 at 4:57 am #

Hi,
I have encoded features and target values using Count CountVectorizer and LabelEncoder respectively. After the prediction of validation dataset. How do i convert predicted target column values back to original values and concatenate with feature column?
pls help.
Thanks in advance.

Reply
- Adrian Tam September 14, 2021 at 1:41 pm #
  
  LabelEncoder has the inverse_transform() function that can help you.
  
  Reply
Ashok September 16, 2021 at 3:22 am #

Thank you Adrian.

Reply
Ahmet November 24, 2021 at 9:19 am #

Hello,

I have a dataset with around 40,000 rows and with 3 attribute and 1 class attribute that I make predictions on and rows are ordered according to that class attribute (0 to 20). I can make predictions one by one row. My question is;

How can I make predictions after looking 5 by 5 rows? I don’t want prediction for each rows, I want to see predictions after looking 5 batched rows. That means 40,000/5=8,000 predictions and calculate accuracy according to these predictions.

Please help…

Reply
- Adrian Tam November 24, 2021 at 1:16 pm #
  
  No, you still run accuracy row by row. That’s how accuracy is supposed to be calculated. You can run by batch, however. Each time you run “model.predict(X)” you can pass in a batch as the array X
  
  Reply
keyvan December 3, 2021 at 10:04 am #

hi Adrian
I have a question. I have a dataset that one feature of any sample is string and another features are numeric and target is numeric too. I change the string feature to a numeric using OrdinalEncoder and train my model.(Random Forest Regression). but now i want to predict a new input but when i give model new input(one feature string and another numeric) python give me an error: ‘could not convert string to float’
can you help me plz?
thanks

Reply
- Adrian Tam December 8, 2021 at 6:49 am #
  
  What code give you that error?
  
  Reply
Georgy June 27, 2022 at 4:11 am #

Hi Adrain,

I am working on a Human activity Classification Model which takes in real time data and predicts and counts the number of times an activty is performed. While training and testing we take in the data from csv. So, since its a time series model, I was planning to do a sliding window for segmentation. I am stuck on understanding how would this be implemented in real time when data comes in from acclerometers. Do I have to store them first to a similar csv and then take the required samples or is there another way to do this in python.

Also, if you could guide me to any sources or materials which implements something same I would be really grateful.

Reply
- James Carmichael June 27, 2022 at 10:33 am #
  
  Hi Georgy…I would recommend storing them in a csv as you suggested as this would enable you to load the data into dataframes and perform data preparation to insure your data was a “clean” and consistent as possible prior to fitting a model on it. The following is a great resource that may be of interest to you:
  
  https://machinelearningmastery.com/data-preparation-for-machine-learning-7-day-mini-course/
  
  Reply
Tom Lu June 29, 2022 at 8:54 am #

Hello Jason, great article. I really enjoy it.

A quick question: If I have a very large number of X[i] and yhat[i], say i = 20,000 and i wish to export the results to an Excel file instead rather than using print(X[i], yhat[i]) function, how do I code that?

Your help is greatly appreciated.

Reply
- James Carmichael June 29, 2022 at 1:07 pm #
  
  Hi Tom…The following resource may be of interest to you:
  
  https://medium.com/@kasiarachuta/importing-and-exporting-csv-files-in-python-7fa6e4d9f408
  
  Reply
zandel August 13, 2022 at 1:15 am #

Thank you so much, spent the evening looking for this answer. (I’m a beginner as far as ML). Haven’t tried your code just yet but it looks good to me so I didn’t wanna wait to say thank you. Btw doesn’t anyone else think it is kinda funny how all the tutorials on the net explain ALMOST everything about predict() except this “little” thing which is actually essential when you want to use the predictions.

Reply
- James Carmichael August 13, 2022 at 6:02 am #
  
  You are very welcome zandel! Thank you for your support and feedback! We greatly appreciate it.
  
  Reply
zandel August 14, 2022 at 2:09 pm #

Could you please help me about the following, not necessarily the whole solution, a pointer to a good article would suffice. Once we have a model that produces good predictions we may want to forget about training and test data and just focus on the data that we want predictions for. I am not even sure what is a proper term, let’s call it “working data” here. So for example tomorrow there will be Monday’s data and I want Monday’s predictions, on Tuesday there will be Tuesday’s data and so on. Of course I would know how to read in the new csv file every day but after that step all examples I found talk about splitting into train and test and I don’t want to do it every time.

Reply
Brian October 27, 2022 at 2:20 pm #

Thank you for a great article. I used this technique to test one of my M/L models. Sadly when I provide new input (same as the first row of the test set) as your method suggests It predicts inaccurately. I am stumped.

Reply
- James Carmichael October 28, 2022 at 8:32 am #
  
  Hi Brian…You may find the following resource helpful to improve accuracy:
  
  https://machinelearningmastery.com/improve-model-accuracy-with-data-pre-processing/
  
  Reply
Brian October 28, 2022 at 5:20 pm #

Thanks James, I am going to look into it more deeply. I actually think I AM getting accurate results, but everything was jumbled in my mind that I may have been confused. Your help is greatly appreciated. I think I will get it sorted out. Thanks!!

Reply
Tabitha November 14, 2022 at 8:16 pm #

Hi,
Great article. I was trying this with Numpy array.
x = np.array([121.75784416 ,124.20093101,116.35841785,108.6051847 , 95.61268754,
77.59019338 ,50.31384723, 23.38073107 , 1.46549092 , -8.56279478,
-13.12810052 ,-14.60476827 ,-14.54182944, -12.89880915 ,-10.53529561,
-6.77584712 , -4.21031351 , -1.80700285 , 0.5323695 , 2.84602143,
5.14464137 , 5.2340063 , 5.67851523 , 5.75220116 , 6.42275109,
7.13850903 , 7.78268196 , 8.22157889 , 8.40510482 , 7.69649975,
7.95839569 , 7.23596462 , 7.71076855 , 6.82815448 , 7.11350441,
6.96096235 , 6.35443428 , 6.16953221 , 4.97815714 , 5.26981308,
4.57670701 , 5.54436694])
I am not quite sure on how to intialize the output that is y, I tried to retain the y as
y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=2)
But I am getting an error i.e.,’tuple’ object has no attribute ‘shape’
Kindly, help me resolve this.

Reply
- James Carmichael November 15, 2022 at 7:51 am #
  
  Hi Tabitha…
  
  Thanks for asking.
  
  I’m eager to help, but I just don’t have the capacity to debug code for you.
  
  I am happy to make some suggestions:
  
  Consider aggressively cutting the code back to the minimum required. This will help you isolate the problem and focus on it.
  Consider cutting the problem back to just one or a few simple examples.
  Consider finding other similar code examples that do work and slowly modify them to meet your needs. This might expose your misstep.
  Consider posting your question and code to StackOverflow.
  
  Reply
Mohammad Aleem April 22, 2023 at 6:40 pm #

Hi Jason B,

I am developing a logistic regression model where input data has a lot of categorical variables as predictors. I used one-hot encoding to get the numerical values for those columns using get_dummies function. The model works fine in predicting outcomes for training and test data, which was created from the dataset using train_test_split() function of sklearn library. Now, I am trying to use a single line/row of input for prediction and cannot run all the steps I ran for the training set of data as get_dummies returns 0 rows for 1 row of data. How do I achieve categorical to numerical conversion for one row of data? Please show an example of doing this. Thanks for the nice article.

Reply
- James Carmichael April 23, 2023 at 10:53 am #
  
  Hi Mohammad…The following resources may be of interest to you regarding logistic regression:
  
  https://machinelearningmastery.com/logistic-regression-tutorial-for-machine-learning/
  
  https://machinelearningmastery.com/implement-logistic-regression-stochastic-gradient-descent-scratch-python/
  
  https://machinelearningmastery.com/multinomial-logistic-regression-with-python/
  
  Reply
Imene August 8, 2023 at 4:12 am #

Hi,

Thank you very much for this interesting topic.

Please, how can we relate the input data with their prediction after a 5-fold cross validation ? For example, if we use the following instruction:

predicted_labels = cross_val_predict(model, texts, labels, cv=5) # texts are our data since it is classification of text using BERT.

Is is possible to put the texts, labels and predicted label all together by using the following code :

predictions_data = {
“Text”: texts,
“Gold_Label”: labels,
“Predicted_Label”: predicted_labels,
}

predictions_with_data_new_order = pd.DataFrame(predictions_data)

My worries are that texts and labels preserve their original order, and hence it would be incorrect to match them with predicted_label that may have a different order due to shuffling data is cross validation.

Reply
Rohit September 19, 2023 at 11:19 pm #

I am working on the Power consumption and production dataset.

I want to predict consumption for each of the customers(say 10) available in the system for the next time stamp.

How can I get the predictions for each of the customers? Also, add these predicted or actual values of current timestamps back to the model trained on the input dataset.

Reply
- James Carmichael September 20, 2023 at 10:21 am #
  
  Hi Rohit…The following resource may be of interest to you:
  
  https://machinelearningmastery.com/how-to-develop-lstm-models-for-multi-step-time-series-forecasting-of-household-power-consumption/
  
  Reply
Rohit September 19, 2023 at 11:21 pm #

Hi, Thanks for the article.

I am working on the Power consumption and production dataset.

I want to predict consumption for each of the customers(say 10) available in the system for the next time stamp.

How can I get the predictions for each of the customers? Also, add these predicted or actual values of current timestamps back to the model trained on the input dataset.

Reply
Vincent March 22, 2024 at 9:46 pm #

hello I am working on a project to predict yields of rice.. My train dataset has 40 features and test has 39 features. I have trained the model and got a good accuracy,,, the problem comes when I want to deploy it as I cannot use all the 40 features as inputs..so I am facing a challenge on how I can specify the input features I want..this is the error I am getting: ValueError
ValueError: X has 4 features, but RandomForestRegressor is expecting 40 features as input.
please help Thank you

Reply
- James Carmichael March 23, 2024 at 8:25 am #
  
  Hi Vincent…The error you’re encountering indicates that your model, a RandomForestRegressor, was trained with data that has 40 features, but when you’re trying to make predictions or apply the model, you’re providing input with only 4 features. This discrepancy is causing the error.
  
  To resolve this issue, you need to ensure that the input data for prediction has the same structure (i.e., the same number of features) as the data you used for training. Here’s a step-by-step approach to tackle the problem:
  
  1. Identify Missing Features
  First, you need to identify which features are missing in your deployment dataset. Compare the columns in your training dataset with those in your deployment dataset to find out which ones are missing.
  
  2. Feature Engineering for Deployment
  If there are features that you cannot directly use or calculate at deployment time, you’ll need to modify your model or the way you’re collecting data. Options include:
  
  Removing Features: If the missing features are not crucial for predictions, you could consider retraining your model without these features. This means you would drop the same features from your training dataset and retrain.
  
  Default Values or Imputation: For some features, it might make sense to fill in missing values with defaults or use some form of imputation (mean, median, mode, or even model-based imputation). This should be done with caution and domain knowledge, as it could affect the accuracy of your model.
  
  Feature Engineering: Sometimes, you can derive the missing features from the ones you have at deployment. This requires understanding the domain and the relationship between features.
  
  3. Consistent Data Transformation
  Ensure that any data preprocessing or transformation (scaling, encoding categorical variables, etc.) applied to the training data is also consistently applied to the data used for predictions.
  
  4. Update Your Deployment Script
  Once you’ve addressed the missing features, update your deployment script to ensure it:
  
  Correctly includes all necessary features.
  Applies the same preprocessing steps as your training pipeline.
  Can handle cases where certain data might be missing or unavailable, if applicable.
  Example Fix
  If the error arises because you’re only supplying 4 features out of 40, you need to ensure all 40 features are included in the input dataframe for prediction. For instance, if you’re using pandas:
  
  python
  Copy code
  import pandas as pd
  
  # Example of creating a dataframe with the correct number of features
  # Assume feature_names is a list of your 40 feature names
  
  # Creating an empty dataframe with the correct structure
  input_df = pd.DataFrame(columns=feature_names)
  
  # Filling the dataframe with your input data
  # Here, you would fill in the rows of input_df with the data you want to predict on
  
  # Ensure the dataframe matches the training data’s structure
  # This might include setting default values for missing columns
  
  # Now you can use input_df for prediction
  predictions = trained_model.predict(input_df)
  If specific features are unavailable or inapplicable at the time of deployment, consider the strategies mentioned earlier for handling them.
  
  Reply

Navigation

How to Connect Model Input Data With Predictions for Machine Learning

Tutorial Overview

Prepare a Training Dataset

How to Fit a Model on the Training Dataset

How to Connect Predictions With Inputs to the Model

Further Reading

Posts

APIs

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

88 Responses to How to Connect Model Input Data With Predictions for Machine Learning

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Prepare a Training Dataset

How to Fit a Model on the Training Dataset

How to Connect Predictions With Inputs to the Model

Further Reading

Posts

APIs

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

88 Responses to How to Connect Model Input Data With Predictions for Machine Learning

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects