In my courses and guides, I teach the preparation of a baseline result before diving into spot checking algorithms.
A student of mine recently asked:
If a baseline is not calculated for a problem, will it make the results of other algorithms questionable?
He went on to ask:
If other algorithms do not give better accuracy than the baseline, what lesson should we take from it? Does it indicate that the data set does not have prediction capability?
These are great questions, they get to the heart of why we create a baseline in the first place and the filtering power it provides.
In this post, you will learn why we create a baseline prediction result, how to create a baseline in general and for specific problem types, and how you can use it to inform you on the data you have available and the algorithms you are using.
Finding Data You Can Model
When you are practicing machine learning, each problem is unique. You very likely have not seen it before and you cannot know what algorithms to use, what data attributes will be useful or even whether the problem can be effectively modeled.
I personally find this the most exciting time.
If you are in this situation, you are very likely collecting the data together yourself from disparate sources and selecting attributes that you think might be valuable. Feature selection and feature engineering will be required.
During this process, you need to get some idea that the problem that you are iteratively trying to define and gather data for provides a useful base for making predictions.
A Useful Point For Comparison
You need to spot check algorithms on the problem to see if you have a useful basis for modeling your prediction problem. But how do you know the results are any good?
You need a basis for comparison of results. You need a meaningful reference point to which to compare.
Once you start collecting results from different machine learning algorithms, a baseline result can tell you whether a change is adding value.
It is so simple, yet so powerful. Once you have a baseline, you can add or change the data attributes, the algorithms you are trying or the parameters of the algorithms, and know whether you have improved your approach or solution to the problem.
Calculate a Baseline Result
There are common ways that you can use to calculate a baseline result.
A baseline result is the simplest possible prediction. For some problems, this may be a random result, and in others in may be the most common prediction.
- Classification: If you have a classification problem, you can select the class that has the most observations and use that class as the result for all predictions. In Weka this is called ZeroR. If the number of observations is equal for all classes in your training dataset, you can select a specific class or enumerate each class and see which gives the better result in your test harness.
- Regression: If you are working on a regression problem, you can use a central tendency measure as the result for all predictions, such as the mean or the median.
- Optimization: If you are working on an optimization problem, you can use a fixed number of random samples in the domain.
It can be a valuable use of your time to brainstorm all of the simplest possible results that you can test for your problem, and then go ahead and evaluate them. The results can be a very effective filtering method. If more advanced modeling methods cannot outperform simple central tendencies then you know you have work to do, most likely better defining or reframing the problem.
The accuracy score you use matters. You must select the accuracy score you plan to use before you calculate your baseline. The score must be related and inform the question you set out to answer by working on the problem in the first place.
If you are working on a classification problem, you may want to look at the Kappa statistic, which gives you an accuracy score that is normalized by the baseline. The baseline accuracy is 0 and scores above zero show an improvement over the baseline.
Compare Results to the Baseline
It is OK if your baseline is a poor result. It may indicate a particular difficulty with the problem or it may mean that your algorithms have a lot of room for improvement.
It does matter if you cannot get an accuracy better than your baseline. It suggests that the problem may be difficult.
You may need to collect more or different data from which to model. You may need to look into using different and perhaps more powerful machine learning algorithms or algorithm configurations. Ultimately, after rounds of these types of changes, you may have a problem that is resistant to prediction and may need to be re-framed.
Your action step for this post is to start investigating your next data problem with a baseline from which you can compare all results.
If you are already working on a problem, include a baseline result and use that to interpret all other results.
Share your results, what is your problem and what baseline are you using?