The problem of predictive modeling is to create models that have good performance making predictions on new unseen data.

Therefore it is critically important to use robust techniques to train and evaluate your models on your available training data. The more reliable the estimate of the performance on your model, the further you can push the performance and be confident it will translate to the operational use of your model.

In this post you will discover the various different ways that you can estimate the performance of your machine learning models in Weka.

After reading this post you will know:

- How to evaluate your model using the training dataset.
- How to evaluate your model using a random train and test split.
- How to evaluate your model using k-fold cross validation.

Let’s get started.

## Model Evaluation Techniques

There are a number of model evaluation techniques that you can choose from, and the Weka machine learning workbench offers four of them, as follows:

### Training Dataset

Prepare your model on the entire training dataset, then evaluate the model on the same dataset. This is generally problematic not least because a perfect algorithm could game this evaluation technique by simply memorizing (storing) all training patterns and achieve a perfect score, which would be misleading.

### Supplied Test Set

Split your dataset manually using another program. Prepare your model on the entire training dataset and use the separate test set to evaluate the performance of the model. This is a good approach if you have a large dataset (many tens of thousands of instances).

### Percentage Split

Randomly split your dataset into a training and a testing partitions each time you evaluate a model. This can give you a very quick estimate of performance and like using a supplied test set, is preferable only when you have a large dataset.

### Cross Validation

Split the dataset into k-partitions or folds. Train a model on all of the partitions except one that is held out as the test set, then repeat this process creating k-different models and give each fold a chance of being held out as the test set. Then calculate the average performance of all k models. This is the gold standard for evaluating model performance, but has the cost of creating many more models.

You can see these techniques in the Weka Explorer on the “Classify” tab after you have loaded a dataset.

## Which Test Option to Use

Given that there are four different test options to choose from, which one should you use?

Each test option has a time and place, summarized as follows:

**Training Dataset**: Only to be used when you have all of the data and you are interested in creating a descriptive rather than a predictive model. Because you have all of the data, you do not need to make new predictions. You are interested in creating a model to better understand the problem.**Supplied Test Set**: When the data is very large, e.g. millions of records and you do not need all of it to train a model. Also useful when the test set has been defined by a third party.**Percentage Split**: Excellent to use to get a quick idea of the performance of a model. Not to be used to make decisions, unless you have a very large dataset and are confident (e.g. you have tested) that the splits sufficiently describe the problem. A common split value is 66% to 34% for train and test sets respectively.**Cross Validation**: The default. To be used when you are unsure. Generally provides a more accurate estimate of the performance than the other techniques. Not to be used when you have a very large data. Common values for k are 5 and 10, depending on the size of the dataset.

If in doubt, use k-fold cross validation where k is set to 10.

### Need more help with Weka for Machine Learning?

Take my free 14-day email course and discover how to use the platform step-by-step.

Click to sign-up and also get a free PDF Ebook version of the course.

## What About The Final Model

Test options are concerned with estimating the performance of a model on unseen data.

This is an important concept to internalize. The goal of predictive modeling is to create a model that performs best in a situation that we do not fully understand, the future with new unknown data. We must use these powerful statistical techniques to best estimate the performance of the model in this situation.

That being said, once we have chosen a model, it must be finalized. None of these test options are used for this purpose.

The model must be trained on the entire training dataset and saved. This topic of model finalization is beyond the scope of this post.

Just note, that the performance of the finalized model does not need to be calculated, it is estimated using the test option techniques discussed above.

## Performance Summary

The performance summary is provided in Weka when you evaluate a model.

In the “Classify” tab after you have evaluated an algorithm by clicking the “Start” button, the results are presented in the “Classifier output” pane.

This pane includes a lot of information, including:

- The run information such as the algorithm and its configuration, the dataset and its properties as well as the test option
- The details of the constructed model, if any.
- The summary of the performance including a host of different measures.

### Classification Performance Summary

When evaluating a machine learning algorithm on a classification problem, you are given a vast amount of performance information to digest.

This is because classification may be the most studied type of predictive modeling problem and there are so many different ways to think about the performance of classification algorithms.

There are three things to note in the performance summary for classification algorithms:

**Classification accuracy**. This the ratio of the number of correct predictions out of all predictions made, often presented as a percentage where 100% is the best an algorithm can achieve. If your data has unbalanced classes, you may want to look into the Kappa metric which presents the same information taking the class balance into account.**Accuracy by class**. Take note of the true-positive and false-positive rates for the predictions for each class which can be instructive of the class breakdown for the problem is uneven or there are more than two classes. This can help you interpret the results if predicting one class is more important than predicting another.**Confusion matrix**. A table showing the number of predictions for each class compared to the number of instances that actually belong to each class. This is very useful to get an overview of the types of mistakes the algorithm made.

### Regression Performance Summary

When evaluating a machine learning algorithm on a regression problem, you are given a number of different performance measures to review.

Of note the performance summary for regression algorithms are two things:

**Correlation Coefficient**. This is how well the predictions are correlated or change with the actual output value. A value of 0 is the worst and a value of 1 is a perfectly correlated set of predictions.**Root Mean Squared Error**. This is the average amount of error made on the test set in the units of the output variable. This measure helps you get an idea on the amount a given prediction may be wrong on average.

## Summary

In this post you discovered how to estimate the performance of your machine learning models on unseen data in Weka.

Specifically, you learned:

- About the importance of estimating the performance of machine learning models on unseen data as the core problem in predictive modeling.
- About the 4 different test options and when to use each, paying specific attention to train and test splits and k-fold cross validation.
- About the performance summary for classification and regression problems and to which metrics to pay attention.

Do you have any questions about estimating model performance in Weka or about this post? Ask your questions in the comments and I will do my best to answer them.

Hi Jason,

I am a little confused on the using cross-validation for testing. Is it that we split original dataset into training and test subsets. Then we use cross-validation on training subset for model selection and use test subset for cross validation to estimate the performance of chosen model?

Thanks

Generally, when exploring different models, using cross validation to evaluate the performance of your model is sufficient.

For additional robustness, it is a good idea to first split your dataset into training and validation datasets, cross validate on the training dataset and once finalized evaluate the dataset on the standalone validation as a last check for overfitting.

Thank you very much!

You’re welcome.

Hi Jason,

i want to evaluate my proposed algorithm by cross validation by classification.i am dividing the data into two parts. test data is unlabeled data and training data is labelled data.then apply 10 folds cross validation on data by regression test and predict the labeled data from test data by training data.i have calculated MSe(mean of square error) and accuracy rate by confusion matrix. but still i am confused how i can evaluate my proposed algorithm by these results?

Hi Jason,

I have a doubt in calculating prediction accuracy. Generally, for classification, we will get the accuracy in weka but for regression how to calculate accuracy? Kindly help

You cannot calculate accuracy in regression. There are no classes.

You must use another measure like RMSE or MAE or similar.

But still, there is a way to evaluate the algorithm. Not sure but I think same as in clustring.

Yes, by evaluating the error of the predictions (e.g. mean squared error or similar).

Hi Jason,

Say, I have 1000 data, I split it into 80% (training dataset) 20% (test dataset). Then I will use the training dataset to perform 10-fold validations where internally it will further split the training dataset into 10% (8% from 1000) validation data and 90% (72% from 1000) training data and rotate each fold and based on generated accuracy I will select my model (model selection). Once model is selected, I will test it with the held-out 20% test data (20% from 1000).

Is the approach correct, as based on some literature this is what I understand? Hence, the terms training, validation and test data.

Another question:

From what I understand to calculate accuracy of k-fold, the formula is:

K-fold accuracy = Total accuracy of individual k-fold / k = Average of total accuracy of individual k-fold

However, can I apply per-class accuracy to calculate accuracy of each k-fold?

Such that, say if my data having 6 different classes, the formula will be:

Accuracy of individual class = Correct prediction of class / Total data

Accuracy of individual k-fold = Total of individual class accuracy / 6 = Average of total of individual class accuracy (in each fold)

K-fold accuracy = Total accuracy of individual k-fold / k

where accuracy of each class is derived from Confusion Matrix.