Train-Test Split for Evaluating Machine Learning Algorithms

By Jason Brownlee on August 26, 2020 in Python Machine Learning 79

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model.

It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.

In this tutorial, you will discover how to evaluate machine learning models using the train-test split.

After completing this tutorial, you will know:

The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
How to use the scikit-learn machine learning library to perform the train-test split procedure.
How to evaluate machine learning algorithms for classification and regression using the train-test split.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Train-Test Split for Evaluating Machine Learning Algorithms
Photo by Paul VanDerWerf, some rights reserved.

Tutorial Overview

This tutorial is divided into three parts; they are:

Train-Test Split Evaluation
1. When to Use the Train-Test Split
2. How to Configure the Train-Test Split
Train-Test Split Procedure in Scikit-Learn
1. Repeatable Train-Test Splits
2. Stratified Train-Test Splits
Train-Test Split to Evaluate Machine Learning Models
1. Train-Test Split for Classification
2. Train-Test Split for Regression

Train-Test Split Evaluation

The train-test split is a technique for evaluating the performance of a machine learning algorithm.

It can be used for classification or regression problems and can be used for any supervised learning algorithm.

The procedure involves taking a dataset and dividing it into two subsets. The first subset is used to fit the model and is referred to as the training dataset. The second subset is not used to train the model; instead, the input element of the dataset is provided to the model, then predictions are made and compared to the expected values. This second dataset is referred to as the test dataset.

Train Dataset: Used to fit the machine learning model.
Test Dataset: Used to evaluate the fit machine learning model.

The objective is to estimate the performance of the machine learning model on new data: data not used to train the model.

This is how we expect to use the model in practice. Namely, to fit it on available data with known inputs and outputs, then make predictions on new examples in the future where we do not have the expected output or target values.

The train-test procedure is appropriate when there is a sufficiently large dataset available.

When to Use the Train-Test Split

The idea of “sufficiently large” is specific to each predictive modeling problem. It means that there is enough data to split the dataset into train and test datasets and each of the train and test datasets are suitable representations of the problem domain. This requires that the original dataset is also a suitable representation of the problem domain.

A suitable representation of the problem domain means that there are enough records to cover all common cases and most uncommon cases in the domain. This might mean combinations of input variables observed in practice. It might require thousands, hundreds of thousands, or millions of examples.

Conversely, the train-test procedure is not appropriate when the dataset available is small. The reason is that when the dataset is split into train and test sets, there will not be enough data in the training dataset for the model to learn an effective mapping of inputs to outputs. There will also not be enough data in the test set to effectively evaluate the model performance. The estimated performance could be overly optimistic (good) or overly pessimistic (bad).

If you have insufficient data, then a suitable alternate model evaluation procedure would be the k-fold cross-validation procedure.

In addition to dataset size, another reason to use the train-test split evaluation procedure is computational efficiency.

Some models are very costly to train, and in that case, repeated evaluation used in other procedures is intractable. An example might be deep neural network models. In this case, the train-test procedure is commonly used.

Alternately, a project may have an efficient model and a vast dataset, although may require an estimate of model performance quickly. Again, the train-test split procedure is approached in this situation.

Samples from the original training dataset are split into the two subsets using random selection. This is to ensure that the train and test datasets are representative of the original dataset.

How to Configure the Train-Test Split

The procedure has one main configuration parameter, which is the size of the train and test sets. This is most commonly expressed as a percentage between 0 and 1 for either the train or test datasets. For example, a training set with the size of 0.67 (67 percent) means that the remainder percentage 0.33 (33 percent) is assigned to the test set.

There is no optimal split percentage.

You must choose a split percentage that meets your project’s objectives with considerations that include:

Computational cost in training the model.
Computational cost in evaluating the model.
Training set representativeness.
Test set representativeness.

Nevertheless, common split percentages include:

Train: 80%, Test: 20%
Train: 67%, Test: 33%
Train: 50%, Test: 50%

Now that we are familiar with the train-test split model evaluation procedure, let’s look at how we can use this procedure in Python.

Train-Test Split Procedure in Scikit-Learn

The scikit-learn Python machine learning library provides an implementation of the train-test split evaluation procedure via the train_test_split() function.

The function takes a loaded dataset as input and returns the dataset split into two subsets.

...
# split into train test sets
train, test = train_test_split(dataset, ...)

...

# split into train test sets

train, test = train_test_split(dataset, ...)

Ideally, you can split your original dataset into input (X) and output (y) columns, then call the function passing both arrays and have them split appropriately into train and test subsets.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, ...)

...

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, ...)

The size of the split can be specified via the “test_size” argument that takes a number of rows (integer) or a percentage (float) of the size of the dataset between 0 and 1.

The latter is the most common, with values used such as 0.33 where 33 percent of the dataset will be allocated to the test set and 67 percent will be allocated to the training set.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

...

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

We can demonstrate this using a synthetic classification dataset with 1,000 examples.

The complete example is listed below.

# split a dataset into train and test sets
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=1000)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# split a dataset into train and test sets

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# create dataset

X, y = make_blobs(n_samples=1000)

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

Running the example splits the dataset into train and test sets, then prints the size of the new dataset.

We can see that 670 examples (67 percent) were allocated to the training set and 330 examples (33 percent) were allocated to the test set, as we specified.

(670, 2) (330, 2) (670,) (330,)

1	(670, 2) (330, 2) (670,) (330,)

Alternatively, the dataset can be split by specifying the “train_size” argument that can be either a number of rows (integer) or a percentage of the original dataset between 0 and 1, such as 0.67 for 67 percent.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)

...

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.67)

Repeatable Train-Test Splits

Another important consideration is that rows are assigned to the train and test sets randomly.

This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset.

This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset. If you are new to pseudo-random number generators, see the tutorial:

Introduction to Random Number Generators for Machine Learning in Python

This can be achieved by setting the “random_state” to an integer value. Any value will do; it is not a tunable hyperparameter.

...
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

...

# split again, and we should see the same split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

The example below demonstrates this and shows that two separate splits of the data result in the same result.

# demonstrate that the train-test split procedure is repeatable
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=100)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])

# demonstrate that the train-test split procedure is repeatable

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# create dataset

X, y = make_blobs(n_samples=100)

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize first 5 rows

print(X_train[:5, :])

# split again, and we should see the same split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize first 5 rows

print(X_train[:5, :])

Running the example splits the dataset and prints the first five rows of the training dataset.

The dataset is split again and the first five rows of the training dataset are printed showing identical values, confirming that when we fix the seed for the pseudorandom number generator, we get an identical split of the original dataset.

[[-2.54341511  4.98947608]
 [ 5.65996724 -8.50997751]
 [-2.5072835  10.06155749]
 [ 6.92679558 -5.91095498]
 [ 6.01313957 -7.7749444 ]]

[[-2.54341511  4.98947608]
 [ 5.65996724 -8.50997751]
 [-2.5072835  10.06155749]
 [ 6.92679558 -5.91095498]
 [ 6.01313957 -7.7749444 ]]

[[-2.54341511 4.98947608]

[ 5.65996724 -8.50997751]

[-2.5072835 10.06155749]

[ 6.92679558 -5.91095498]

[ 6.01313957 -7.7749444 ]]

[[-2.54341511 4.98947608]

[ 5.65996724 -8.50997751]

[-2.5072835 10.06155749]

[ 6.92679558 -5.91095498]

[ 6.01313957 -7.7749444 ]]

Stratified Train-Test Splits

One final consideration is for classification problems only.

Some classification problems do not have a balanced number of examples for each class label. As such, it is desirable to split the dataset into train and test sets in a way that preserves the same proportions of examples in each class as observed in the original dataset.

This is called a stratified train-test split.

We can achieve this by setting the “stratify” argument to the y component of the original dataset. This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

...

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

We can demonstrate this with an example of a classification dataset with 94 examples in one class and six examples in a second class.

First, we can split the dataset into train and test sets without the “stratify” argument. The complete example is listed below.

# split imbalanced dataset into train and test sets without stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)
print(Counter(y_train))
print(Counter(y_test))

# split imbalanced dataset into train and test sets without stratification

from collections import Counter

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# create dataset

X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)

print(Counter(y))

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1)

print(Counter(y_train))

print(Counter(y_test))

Running the example first reports the composition of the dataset by class label, showing the expected 94 percent vs. 6 percent.

Then the dataset is split and the composition of the train and test sets is reported. We can see that the train set has 45/5 examples in the test set has 49/1 examples. The composition of the train and test sets differ, and this is not desirable.

Counter({0: 94, 1: 6})
Counter({0: 45, 1: 5})
Counter({0: 49, 1: 1})

Counter({0: 94, 1: 6})

Counter({0: 45, 1: 5})

Counter({0: 49, 1: 1})

Next, we can stratify the train-test split and compare the results.

# split imbalanced dataset into train and test sets with stratification
from collections import Counter
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)
print(Counter(y))
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)
print(Counter(y_train))
print(Counter(y_test))

# split imbalanced dataset into train and test sets with stratification

from collections import Counter

from sklearn.datasets import make_classification

from sklearn.model_selection import train_test_split

# create dataset

X, y = make_classification(n_samples=100, weights=[0.94], flip_y=0, random_state=1)

print(Counter(y))

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=1, stratify=y)

print(Counter(y_train))

print(Counter(y_test))

Given that we have used a 50 percent split for the train and test sets, we would expect both the train and test sets to have 47/3 examples in the train/test sets respectively.

Running the example, we can see that in this case, the stratified version of the train-test split has created both the train and test datasets with 47/3 examples in the train/test sets as we expected.

Counter({0: 94, 1: 6})
Counter({0: 47, 1: 3})
Counter({0: 47, 1: 3})

Counter({0: 94, 1: 6})

Counter({0: 47, 1: 3})

Now that we are familiar with the train_test_split() function, let’s look at how we can use it to evaluate a machine learning model.

Train-Test Split to Evaluate Machine Learning Models

In this section, we will explore using the train-test split procedure to evaluate machine learning models on standard classification and regression predictive modeling datasets.

Train-Test Split for Classification

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the sonar dataset.

The sonar dataset is a standard machine learning dataset composed of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification.

The dataset involves predicting whether sonar returns indicate a rock or simulated mine.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads the dataset and summarizes its shape.

# summarize the sonar dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
# split into input and output elements
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

# summarize the sonar dataset

from pandas import read_csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

# split into input and output elements

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Running the example downloads the dataset and splits it into input and output elements. As expected, we can see that there are 208 rows of data with 60 input variables.

(208, 60) (208,)

1	(208, 60) (208,)

We can now evaluate a model using a train-test split.

First, the loaded dataset must be split into input and output components.

...
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

...

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Next, we can split the dataset so that 67 percent is used to train the model and 33 percent is used to evaluate it. This split was chosen arbitrarily.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

...

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

We can then define and fit the model on the training dataset.

...
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)

...

# fit the model

model = RandomForestClassifier(random_state=1)

model.fit(X_train, y_train)

Then use the fit model to make predictions and evaluate the predictions using the classification accuracy performance metric.

...
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

...

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

acc = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % acc)

Tying this together, the complete example is listed below.

# train-test split evaluation random forest on the sonar dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestClassifier(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)

# train-test split evaluation random forest on the sonar dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# fit the model

model = RandomForestClassifier(random_state=1)

model.fit(X_train, y_train)

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

acc = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % acc)

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

The dataset is split into train and test sets and we can see that there are 139 rows for training and 69 rows for the test set.

Finally, the model is evaluated on the test set and the performance of the model when making predictions on new data has an accuracy of about 78.3 percent.

(208, 60) (208,)
(139, 60) (69, 60) (139,) (69,)
Accuracy: 0.783

(208, 60) (208,)

(139, 60) (69, 60) (139,) (69,)

Accuracy: 0.783

Train-Test Split for Regression

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the housing dataset.

The housing dataset is a standard machine learning dataset composed of 506 rows of data with 13 numerical input variables and a numerical target variable.

The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston.

No need to download the dataset; we will download it automatically as part of our worked examples.

The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.

# load and summarize the housing dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
# summarize shape
print(dataframe.shape)

# load and summarize the housing dataset

from pandas import read_csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

# summarize shape

print(dataframe.shape)

Running the example confirms the 506 rows of data and 13 input variables and single numeric target variables (14 in total).

(506, 14)

(506, 14)

We can now evaluate a model using a train-test split.

First, the loaded dataset must be split into input and output components.

...
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)

...

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

Next, we can split the dataset so that 67 percent is used to train the model and 33 percent is used to evaluate it. This split was chosen arbitrarily.

...
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

...

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

We can then define and fit the model on the training dataset.

...
# fit the model
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)

...

# fit the model

model = RandomForestRegressor(random_state=1)

model.fit(X_train, y_train)

Then use the fit model to make predictions and evaluate the predictions using the mean absolute error (MAE) performance metric.

...
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

...

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Tying this together, the complete example is listed below.

# train-test split evaluation random forest on the housing dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
data = dataframe.values
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
# fit the model
model = RandomForestRegressor(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

# train-test split evaluation random forest on the housing dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestRegressor

from sklearn.metrics import mean_absolute_error

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

data = dataframe.values

# split into inputs and outputs

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

# fit the model

model = RandomForestRegressor(random_state=1)

model.fit(X_train, y_train)

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

mae = mean_absolute_error(y_test, yhat)

print('MAE: %.3f' % mae)

Running the example first loads the dataset and confirms the number of rows in the input and output elements.

The dataset is split into train and test sets and we can see that there are 339 rows for training and 167 rows for the test set.

Finally, the model is evaluated on the test set and the performance of the model when making predictions on new data is a mean absolute error of about 2.211 (thousands of dollars).

(506, 13) (506,)
(339, 13) (167, 13) (339,) (167,)
MAE: 2.157

(506, 13) (506,)

(339, 13) (167, 13) (339,) (167,)

MAE: 2.157

Summary

In this tutorial, you discovered how to evaluate machine learning models using the train-test split.

Specifically, you learned:

The train-test split procedure is appropriate when you have a very large dataset, a costly model to train, or require a good estimate of model performance quickly.
How to use the scikit-learn machine learning library to perform the train-test split procedure.
How to evaluate machine learning algorithms for classification and regression using the train-test split.

Do you have any questions?
Ask your questions in the comments below and I will do my best to answer.

79 Responses to Train-Test Split for Evaluating Machine Learning Algorithms

Usman July 24, 2020 at 10:57 am #

Excellent tutorial. Thanks

Reply
- Jason Brownlee July 24, 2020 at 10:59 am #
  
  Thanks Usman.
  
  Reply
Yuthika July 24, 2020 at 2:22 pm #

Well explained tutorial ????

Reply
- Jason Brownlee July 25, 2020 at 6:05 am #
  
  Thanks!
  
  Reply

bala July 24, 2020 at 6:51 pm #

hi jason,its wonderful explanation about train-test-split function i ever heard.i just made some modification to the code to find the exact point at which the accuracy is maximum and also to find additional insights.

value_list=np.zeros(100)
accuracy_list=np.zeros(100)
# fit the model
for value in range(1,101):
    value_list[value-1]=value
    model = RandomForestClassifier(n_estimators=value,random_state=1)
    model.fit(X_train, y_train)
    # make predictions
    yhat = model.predict(X_test)
    # evaluate predictions
    acc = accuracy_score(y_test, yhat)
    accuracy_list[value-1]=acc
    
plt.plot(value_list,accuracy_list)
plt.show()
maximum_Value=np.where(accuracy_list == np.amax(accuracy_list))
print(maximum_Value)

value_list=np.zeros(100)

accuracy_list=np.zeros(100)

# fit the model

for value in range(1,101):

value_list[value-1]=value

model = RandomForestClassifier(n_estimators=value,random_state=1)

model.fit(X_train, y_train)

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

acc = accuracy_score(y_test, yhat)

accuracy_list[value-1]=acc

plt.plot(value_list,accuracy_list)

plt.show()

maximum_Value=np.where(accuracy_list == np.amax(accuracy_list))

print(maximum_Value)

Jason Brownlee July 25, 2020 at 6:15 am #

Thanks.

Reply

Prasenjit Mondal July 25, 2020 at 1:51 pm #

Awesome

Reply
- Jason Brownlee July 26, 2020 at 6:13 am #
  
  Thanks!
  
  Reply
hanan Alsaiari July 26, 2020 at 3:53 am #

It is very useful.thank you so much.

i m looking for implementation Stackedautoencoder (high level denoising) in python regression problem please .

Reply
- Jason Brownlee July 26, 2020 at 6:24 am #
  
  You’re welcome.
  
  Thanks for the suggestion, I hope to write about the topic in the future.
  
  Reply
  - hanan Alsaiari July 27, 2020 at 2:07 am #
    
    thank you so much for your quick replay. what is the best way to communicate with I have some question for my projects please.
    
    Reply
    - Jason Brownlee July 27, 2020 at 5:49 am #
      
      You can contact me any time, I’m happy to answer questions, but I cannot review code/data/papers:
      https://machinelearningmastery.com/contact/
      
      Reply
S AYISHA August 3, 2020 at 4:15 pm #

I have a doubt. How to split data with date as X varible. Because, svr model doestn’t fit for date variable. What do we do in such case?

Reply
- Jason Brownlee August 4, 2020 at 6:35 am #
  
  Typically we remove the date from the data prior to modeling.
  
  Reply
sukh August 19, 2020 at 12:05 am #

sir if we add softmax function in binary classification for classification layer over sigmoid function?is there any benefits of softmax function over sigmoid?

Reply
- Jason Brownlee August 19, 2020 at 6:02 am #
  
  No, not for binary classification. It would probably be slightly less efficient.
  
  Reply
Dina September 17, 2020 at 1:30 am #

When doing a PhD, do you use random_state with train test split?

Reply
- Jason Brownlee September 17, 2020 at 6:49 am #
  
  Sorry, I don’t understand. What does phd have to do with train/test split?
  
  You can design the experiments anyway you like, as long as you justify your decisions.
  
  Reply
  - D September 17, 2020 at 8:33 pm #
    
    I meant when splitting the data, if I use Random state, then my results will always be the same.
    
    However, if Random state =None, then every time I will get a different result for the classifier.
    
    I can use random state=1234 and my results are over 80%
    using random state=none, it can range from 60-80
    
    Is it common practice, for Phd students, to set random state to a number that gives you the best results?
    
    Reply
    - Jason Brownlee September 18, 2020 at 6:45 am #
      
      Correct.
      
      It depends on the specifics of your project, not on what degree you’re doing. As I said, you can choose any methodology you like as long as you justify it.
      
      Reply
Juliana October 2, 2020 at 1:33 pm #

Hi Jason, I don’t want to split the data into train and test. I want to train ALL the records against my dataset. How to code it using python as test_size has to be greater than 0? Thanks.

Reply
- Jason Brownlee October 2, 2020 at 2:24 pm #
  
  Good question, see this tutorial:
  https://machinelearningmastery.com/make-predictions-scikit-learn/
  
  Reply
HSA October 27, 2020 at 5:10 am #

How to make sure that training examples are not repeated in testing examples?

Reply
- Jason Brownlee October 27, 2020 at 6:50 am #
  
  The train test split does this for you, as long as each row is unique.
  
  Reply
Sameer RS October 28, 2020 at 7:24 pm #

Hi Jason,

Nice & informative article. Thanks for sharing your thoughts regarding the same & giving more clarity to the topic.

However, with reference to the above topic, I have few doubts as follows:

a) Nowadays there is a trend being observed that dataset is split into 3 parts – Train set, Test Set & Validation Set.

However, a cross question—is this 3-way split necessary or will a 2-way split simply suffice?

i) If the answer is in affirmative, why do you do so and what are the advantages of a 3-way split over a 2-way split?

ii) If your reply is in the negative, what are the reasons for avoiding a 3-way split of the given dataset(s)?

b) Is a 3-way split superior to a 2-way spit? Kindly explain.

c) Does a 3-way split result in the following:—

i) Is there loss of the original data?

ii) Does it result in a Bias & Variance Tradeoff ie. over-fitting of the model?

d) What is this k-fold validation procedure? Is there any recommended reference material that you suggest?

Reply
- Jason Brownlee October 29, 2020 at 7:59 am #
  
  The split you perform depends on your project and dataset.
  
  Validation sets are used to tune the model hyperparameters.
  
  You can learn more about validation datasets here:
  https://machinelearningmastery.com/difference-test-validation-datasets/
  
  Reply
Sameer RS October 28, 2020 at 8:48 pm #

One more doubt—

Why is it that in Python, we split the datasets into X_train, X test, y-train, y-test?

Am asking this query specifically—as all this while I have worked with R tool. In R, simply you divide the dataset into train-set & test-set?

Similarly, I believe you can do the same in Python by using & thereafter executing the following code viz.:

train_set,test_set = train_test_split(dataset_name, test_size = 0.3)
print(train_set)

However,why or for what reasons is the one stated by you in the aforesaid tutorial favoured or rather extensively used??? I can see a replica of similar codes being used in other websites also.

What is wrong with the above code or what the limitations involved?

Need to understand the logic and reasons behind this.

Reply
- Jason Brownlee October 29, 2020 at 8:02 am #
  
  We train the model on the training set and evaluate its performance on the test set.
  
  The limitation of train/test split is that it has a high varaince. This can be over come using k-fold cross validation.
  
  Reply
Marios Png November 14, 2020 at 4:20 am #

Is the concept of the random split into train-test samples applicable for the occasions where time step is used in order to give artificially a 3rd dimension to our data set like in Convolutional or Recurrent Neural Networks? Or in this case is more preferable to have a sequential split into train-test samples?

Reply
- Jason Brownlee November 14, 2020 at 6:38 am #
  
  Yes. See this:
  https://machinelearningmastery.com/backtest-machine-learning-models-time-series-forecasting/
  
  Reply
Ran Da January 19, 2021 at 3:54 am #

Hello again, my data contain 63 features and 70 rows. when i use it with linear regression without “Train test split” i get an MAE value “0.3”. And with “Train test split” i get an MAE value “4.9”.
so should i split the dataset? the MAE value “0.3” is consiered as incorrect(overfiting)?

Reply
- Jason Brownlee January 19, 2021 at 6:38 am #
  
  No, it may be the split makes the two sets too small to be useful.
  
  Perhaps use k-fold cross-validation instead.
  
  Reply
  - Ran Da January 19, 2021 at 8:18 am #
    
    so the result is consiered correct even if am not using training data?
    
    Reply
    - Jason Brownlee January 19, 2021 at 9:43 am #
      
      Sorry, I don’t understand what you mean by correct? Perhaps you can elaborate?
      
      Reply
      - Ran Da January 19, 2021 at 9:57 am #
        
        i mean the MAE value “0.3” is not considering as an overfiting? and i sholdn’t use train test split(training the dataset)
      - Jason Brownlee January 19, 2021 at 10:06 am #
        
        Generally, if the model performs better on the training set than the test set, and test set performance is not skillful, the model might be overfitting.
        
        Perhaps this will help:
        https://machinelearningmastery.com/overfitting-machine-learning-models/
khansa Rana April 23, 2021 at 4:58 am #

how to download the train & test datasets after split the dataset???

Reply
- Jason Brownlee April 23, 2021 at 5:06 am #
  
  Do you mean save? If so, see this:
  https://machinelearningmastery.com/how-to-save-a-numpy-array-to-file-for-machine-learning/
  
  Reply
Kamal Silva June 17, 2021 at 7:08 pm #

Hello Jason,
Great article! I have a small question. At which phase should we need to do the splitting according to the data mining process? Is it after preprocessing or after doing the transformations.? Does it have any effect on data leakage?

Reply
- Jason Brownlee June 18, 2021 at 5:38 am #
  
  Before data preparation.
  
  This will help you avoid data leakage:
  https://machinelearningmastery.com/data-preparation-without-data-leakage/
  
  Reply
  - Kamal Silva June 20, 2021 at 2:12 am #
    
    Thank you very much Jason
    
    Reply
Nour June 22, 2021 at 8:10 pm #

if i want to know the indexes of x_test and x_train in the original file, what is the code ?
(x_test has elements 6,7,9,1) I want to know these indexes from dataset file.
Thanks

Reply
- Jason Brownlee June 23, 2021 at 5:36 am #
  
  The train/test split will return arrays of rows indexes directly.
  
  Reply
Dylan June 27, 2021 at 3:26 am #

Hi Jason, I’ve recently applied a non-standard method for model evaluation. The method has a problem of being computationally expensive, but I’m having trouble convincing myself that standard methods like are sufficient. I was hoping you can provide input.

My goal is to prove that the addition of a new feature yields performance improvements. Since data splits influences results, I generate k train/test splits. The “train” split will be split into a training and validation set by algorithm and it will use one of the methods that you described in your article. The test set is a hold out set. The key difference is that I evaluate my model on multiple test sets.

The equivalent splits are performed on the original dataset and the one with the new features. K models are trained with the same parameters to produce the baseline and the model with the new feature. These models are evaluated against their corresponding test set. I then run a t-test on the distribution of evaluation metrics to demonstrate whether or not there is an improvement.

I am not convinced that a method like k-fold cross validation can guarantee that a test split might by chance favor one scenario. Thus, I applied the method described above. I was hoping you can explain why it does or validate my method.

Thank you!

Reply
- Jason Brownlee June 27, 2021 at 4:42 am #
  
  If the method gives you confidence, then go for it.
  
  I’d be wary going against 40+ years of experience in the field, e.g. repeated k-fold cross-validation + modified student’s t-test is the gold standard:
  https://machinelearningmastery.com/statistical-significance-tests-for-comparing-machine-learning-algorithms/
  
  Reply
Orange August 17, 2021 at 12:32 am #

Thank you for another helpful article.
In case of unsupervised approach, would stratify y work to balance both train and test datasets? or is there an alternative approach?

Reply
- Adrian Tam August 17, 2021 at 8:00 am #
  
  Yes, it works. Indeed stratify is also one way to deal with imbalanced datasets.
  
  Reply
Deepti August 29, 2021 at 1:05 am #

Excellent Tutorial.
But while executing:
X = preprocessing.StandardScalar().fit(X).transform(X) #.astype(float))
X[0:5]
Getting Error:
AttributeError: module ‘sklearn.preprocessing’ has no attribute ‘StandardScalar’

Another Can’t convert string to float

Reply
- Adrian Tam August 29, 2021 at 12:27 pm #
  
  You spelled it wrong. StandardScaler with a “e”.
  
  Reply
Prem September 15, 2021 at 11:28 am #

Hi Jason,
Thanks for this tutorial,

1. Do we have to do the split before doing normalisation or after, which is normalisation only on the training data and use the scalar on the test data?
I think doing only on the training data is correct.

2. If doing only on the training data, how to do stratified split so that all string column values are evenly distributed on both train and test dataframes.
Found this in stackoverflow https://stackoverflow.com/a/51525992/11053801 whether is this good to do?

Thanks

Reply
- Adrian Tam September 16, 2021 at 12:48 am #
  
  Normalization means you applied a scaler. This should be done for all data you ever feed into the model. You can fit the scaler with training data only but this fitted scaler should be reused for all input.
  
  Reply
  - Prem September 20, 2021 at 9:11 am #
    
    Thanks for the answer.
    
    How to ensure the test, train split has all possible unique values of string columns in both X_Train and X_test?
    
    Reply
    - Adrian Tam September 20, 2021 at 2:36 pm #
      
      Not sure about that. What are the string columns you’re asking?
      
      Reply
      - Prem September 21, 2021 at 9:10 am #
        
        Categorical Columns, If a particular column has 10 unique values, we have to ensure train and test data to have all 10 values,
        
        Instead of doing stratify in train_test_split based on Target column, May I know how to do based on entire dataset?
Amnah October 3, 2021 at 7:18 am #

Hi, I build many deep learning classification models but I didn’t know how to identify the input shape for the models after I split my dataset using train_test_spilt(). also, I wanna ask if the input shape differs from one model to another?? I try (x_train.shape[1]) , (x_train.shape[1:]), (x_train.shape[0],x_train.shape[1]), (x_train.shape[1],x_train.shape[2]), and some numbers but all of theme has a problem when I tried to fit the model.

Reply
- Adrian Tam October 6, 2021 at 7:43 am #
  
  Make sure you identify the output of train_test_split() for which dimension is the feature and which is the sample. You shouldn’t include samples in the input shape, so mostly x_train.shape[0] should not be involve. For more detail, please see the long answer in https://stackoverflow.com/questions/44747343/keras-input-explanation-input-shape-units-batch-size-dim-etc
  
  Reply
Gloria November 21, 2021 at 10:26 pm #

Hi Adrian, thanks for this tutorial. I’d like to ask.
Stratified Train-Test Splits require us to randomize the order of the data. what if I don’t want to shuffle it?
In other words, keep dividing the data according to the percentage we want in each label but keep the data in order.
Thank you.

Reply
- Adrian Tam November 23, 2021 at 1:11 pm #
  
  Look at the doc, there is a “shuffle” parameter that you can set to “False”: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
  
  Reply
Alina Devkota January 22, 2022 at 9:06 pm #

How do we know for how many epochs should we train the model in such a setting?

Reply
- James Carmichael January 23, 2022 at 9:23 am #
  
  Hi Alina…You should set an upper limit to epochs to avoid overtraining. Plotting loss curves versus epochs will be beneficial. The following may be helpful:
  
  https://www.pluralsight.com/guides/data-visualization-deep-learning-model-using-matplotlib
  
  https://machinelearningmastery.com/early-stopping-to-avoid-overtraining-neural-network-models/
  
  Reply
Marco February 22, 2022 at 7:23 am #

Thank you very much for this helpful article.

I have a dataset made of different measurements of 2 signals and all the measurements have the same length, therefore each input sample is a matrix nx2. To each input matrix one scalar output should correspond.
I would please like to ask how to create a dataset that pairs each input matrix to the corresponding real number output? and what would be most suitable machine learning method for this kind of problem?

Thank you again

Reply
- James Carmichael February 23, 2022 at 12:34 pm #
  
  Hi Marco…Please clarify the goals of your machine learning model so that I may better assist you.
  
  Reply
  - Marco February 23, 2022 at 10:51 pm #
    
    Hi James,
    
    thank you for your reply. I am trying to predict the phase shift between the two signals.
    
    Reply
  - Marco February 23, 2022 at 10:54 pm #
    
    Hi James, thank you for your reply.. I am trying to predict the phase shift between the 2 signals
    
    Reply
Brij Bhushan March 14, 2022 at 3:16 pm #

I genuinely enjoy reading your articles. Your blog provided us useful information. You have done an outstanding job.

Reply
Macduff Olusa April 19, 2022 at 11:02 am #

This is the most lucid ML article I have ever read. Thank you for taking you time to write.

Reply
- James Carmichael April 20, 2022 at 6:56 am #
  
  Thank you for the feedback Macduff!
  
  Reply
Jishan Ahmed May 15, 2022 at 12:10 pm #

Excellent tutorial! However, is it wise to stratify the continuous y (target) variable when you split your training and testing data from the total sample in regression setting? In regression setting, we can not even use sklearn stratify=y' argument in sklearn train_test_split` function. I appreciate your time. Thanks!

Reply
Kondal May 16, 2022 at 6:28 am #

Hi James,

I have following doubt when decission tree support 1 split i.e predictor and target to evaluate data accuracy whereas randomforest is not supporting as we need to split it into x test x train and y test and y train may I know the reason behind these two split methods also explain if any model based split technique do we need to follow

Reply
- James Carmichael May 16, 2022 at 9:06 am #
  
  Hi Kondal…I am aware of potential issues when using small datasets. Could this be the case with your application?
  
  The following may also add clarity:
  
  https://realpython.com/train-test-split-python-data/
  
  https://www.kaggle.com/code/rafjaa/dealing-with-very-small-datasets/notebook
  
  Reply
Ben November 12, 2022 at 10:49 pm #

Hi. Thank you for the wonderful tutorial.

May I know what is the sequence of steps if we want to standardize and resample the dataset before training?
1) Split > Standardize > Resample
2) Split > Resample > Standardize
3) Standardize > Split > Resample

Reply
- James Carmichael November 13, 2022 at 9:23 am #
  
  Hi Ben…The following resource may be of interest:
  
  https://machinelearningmastery.com/training-validation-test-split-and-cross-validation-done-right/
  
  Reply
Vedant Gandhi March 4, 2023 at 9:56 pm #

TypeError: Indexing elements must be in increasing order
I’m getting this error when I’m using the function, could you help me with it?

with h5py.File(path,”r”) as hdf:
X_train, X_test = train_test_split( hdf[‘X’], test_size=0.20, random_state=33)

Reply
- James Carmichael March 5, 2023 at 9:32 am #
  
  Hi Vedant…You may find the following discussion helpful:
  
  https://github.com/madebyollin/acapellabot/issues/1
  
  Reply
aya July 10, 2023 at 8:29 am #

if we have two data frames for train and testing, and we want to split them, how to do that.
Note: we want to use these files separately, so we won’t merge two files and use train_test split.
we want to do a split in this case.

Reply
Sibutha February 26, 2024 at 3:36 am #

After reading this, I feel I can takeon Machine Learning. I was a bit intimidated initial. Now its all clarified. Thanks dude

Reply
- James Carmichael February 26, 2024 at 4:57 am #
  
  Thank you for the feedback Sibutha! We appreciate it!
  
  Reply
Allen June 27, 2024 at 7:27 am #

Hi!

So, I understand that train-test-split should be used during model selection/hyperparameter tuning, but what about when training our best model to be deployed on real, unseen data? Should the model still only be fitted on the training data, or can it use all the available labelled data that we have?

I was told that, since the data we have is only a (non-random) slice of all the existing data in the world, using all of our data might cause overfitting. On the other hand, the dataset I’m dealing with now is very imbalanced (~48 positives and ~5000 negatives), so I want to fit the model on as many of the few positive observations that we have.

Thanks!

(Also, I’ve learned a lot from this blog. I appreciate the dedication and the friendly tutorials!)

Reply
- James Carmichael June 27, 2024 at 9:11 am #
  
  Hi Allen…When training your final model for deployment, the general principle is to use as much relevant data as possible to maximize the performance of your model on unseen data. Here are some key considerations:
  
  ### Model Training for Deployment
  
  1. **Use All Available Data**:
  – **Final Model Training**: Once you have selected the best model and tuned its hyperparameters using techniques like cross-validation or a train-test split, you should train your final model on all available labeled data. This includes both the training and validation sets used during model selection.
  – **Reason**: More data typically allows the model to learn better representations and generalize better to new, unseen data. This is especially important for imbalanced datasets where positive examples are scarce.
  
  2. **Overfitting Concerns**:
  – **Overfitting Risk**: The concern about overfitting arises primarily during the model selection and hyperparameter tuning phase. Overfitting happens when a model learns not only the underlying patterns but also the noise in the training data.
  – **Mitigation**: Using techniques like cross-validation helps mitigate overfitting during model selection. When training the final model on all data, overfitting risk is generally lower since the model has already been validated on separate data during the selection process.
  
  ### Imbalanced Data
  
  1. **Imbalanced Dataset**:
  – **Issue**: Imbalanced datasets, like your example with ~48 positives and ~5000 negatives, pose a challenge because the model may become biased towards the majority class.
  – **Solutions**: Several techniques can help address imbalanced data:
  – **Resampling**: Use oversampling (e.g., SMOTE) to increase the number of positive examples or undersampling to reduce the number of negative examples.
  – **Class Weight Adjustment**: Adjust the class weights in your model to give more importance to the minority class.
  – **Algorithm Selection**: Some algorithms are inherently better at handling imbalanced datasets, such as tree-based methods and ensemble techniques.
  
  ### Practical Steps for Final Model Training
  
  1. **Hyperparameter Tuning and Model Selection**:
  – Use train-test split or cross-validation to select the best model and tune hyperparameters.
  – Evaluate model performance on a held-out validation set to ensure it generalizes well.
  
  2. **Training Final Model**:
  – Once the best model and hyperparameters are determined, retrain the model on the entire dataset (including the previous training and validation sets).
  – This maximizes the amount of data the model learns from, improving its robustness and generalizability.
  
  ### Example Workflow
  
  1. **Data Preparation**:
  python from sklearn.model_selection import train_test_split
  X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
  
  2. **Model Selection and Hyperparameter Tuning**:
  python from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier
  param_grid = { 'n_estimators': [100, 200], 'max_depth': [10, 20] } model = RandomForestClassifier(class_weight='balanced') grid_search = GridSearchCV(model, param_grid, cv=5, scoring='roc_auc') grid_search.fit(X_train, y_train)
  best_model = grid_search.best_estimator_
  
  3. **Final Model Training**:
  python # Combine training and validation data X_final = np.concatenate((X_train, X_val)) y_final = np.concatenate((y_train, y_val))
  # Train the best model on all available data best_model.fit(X_final, y_final)
  
  4. **Model Evaluation**:
  – Evaluate the model on separate test data or through cross-validation to ensure it performs well.
  – Monitor performance metrics like ROC AUC, precision, recall, and F1-score, especially in the context of imbalanced data.
  
  By following these steps, you ensure that your model is well-tuned and trained on the maximum amount of relevant data, which is crucial for achieving the best possible performance on real-world, unseen data.
  
  Reply

Navigation

Train-Test Split for Evaluating Machine Learning Algorithms

Tutorial Overview

Train-Test Split Evaluation

When to Use the Train-Test Split

How to Configure the Train-Test Split

Train-Test Split Procedure in Scikit-Learn

Repeatable Train-Test Splits

Stratified Train-Test Splits

Train-Test Split to Evaluate Machine Learning Models

Train-Test Split for Classification

Train-Test Split for Regression

Further Reading

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

79 Responses to Train-Test Split for Evaluating Machine Learning Algorithms

Leave a Reply Click here to cancel reply.

Navigation

Tutorial Overview

Train-Test Split Evaluation

When to Use the Train-Test Split

How to Configure the Train-Test Split

Train-Test Split Procedure in Scikit-Learn

Repeatable Train-Test Splits

Stratified Train-Test Splits

Train-Test Split to Evaluate Machine Learning Models

Train-Test Split for Classification

Train-Test Split for Regression

Further Reading

Summary

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

79 Responses to Train-Test Split for Evaluating Machine Learning Algorithms

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects