Last Updated on January 6, 2017
There are key concepts in machine learning that lay the foundation for understanding the field.
In this post, you will learn the nomenclature (standard terms) that is used when describing data and datasets.
You will also learn the concepts and terms used to describe learning and modeling from data that will provide a valuable intuition for your journey through the field of machine learning.
Machine learning methods learn from examples. It is important to have good grasp of input data and the various terminology used when describing data. In this section, you will learn the terminology used in machine learning when referring to data.
When I think of data, I think of rows and columns, like a database table or an Excel spreadsheet. This is a traditional structure for data and is what is common in the field of machine learning. Other data like images, videos, and text, so-called unstructured data is not considered at this time.
Instance: A single row of data is called an instance. It is an observation from the domain.
Feature: A single column of data is called a feature. It is a component of an observation and is also called an attribute of a data instance. Some features may be inputs to a model (the predictors) and others may be outputs or the features to be predicted.
Data Type: Features have a data type. They may be real or integer-valued or may have a categorical or ordinal value. You can have strings, dates, times, and more complex types, but typically they are reduced to real or categorical values when working with traditional machine learning methods.
Datasets: A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.
Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.
Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to train the model. It may be called the validation dataset.
We may have to collect instances to form our datasets or we may be given a finite dataset that we must split into sub-datasets.
Machine learning is indeed about automated learning with algorithms.
In this section, we will consider a few high-level concepts about learning.
Induction: Machine learning algorithms learn through a process called induction or inductive learning. Induction is a reasoning process that makes generalizations (a model) from specific information (training data).
Generalization: Generalization is required because the model that is prepared by a machine learning algorithm needs to make predictions or decisions based on specific data instances that were not seen during training.
Over-Learning: When a model learns the training data too closely and does not generalize, this is called over-learning. The result is poor performance on data other than the training dataset. This is also called over-fitting.
Under-Learning: When a model has not learned enough structure from the database because the learning process was terminated early, this is called under-learning. The result is good generalization but poor performance on all data, including the training dataset. This is also called under-fitting.
Online Learning: Online learning is when a method is updated with data instances from the domain as they become available. Online learning requires methods that are robust to noisy data but can produce models that are more in tune with the current state of the domain.
Offline Learning: Offline learning is when a method is created on pre-prepared data and is then used operationally on unobserved data. The training process can be controlled and can tuned carefully because the scope of the training data is known. The model is not updated after it has been prepared and performance may decrease if the domain changes.
Supervised Learning: This is a learning process for generalizing on problems where a prediction is required. A “teaching process” compares predictions by the model to known answers and makes corrections in the model.
Unsupervised Learning: This is a learning process for generalizing the structure in the data where no prediction is required. Natural structures are identified and exploited for relating instances to each other.
We have covered supervised and unsupervised learning before in the post on machine learning algorithms. These terms can be useful for classifying algorithms by their behavior.
The artefact created by a machine learning process could be considered a program in its own right.
Model Selection: We can think of the process of configuring and training the model as a model selection process. Each iteration we have a new model that we could choose to use or to modify. Even the choice of machine learning algorithm is part of that model selection process. Of all the possible models that exist for a problem, a given algorithm and algorithm configuration on the chosen training dataset will provide a finally selected model.
Inductive Bias: Bias is the limits imposed on the selected model. All models are biased which introduces error in the model, and by definition all models have error (they are generalizations from observations). Biases are introduced by the generalizations made in the model including the configuration of the model and the selection of the algorithm to generate the model. A machine learning method can create a model with a low or a high bias and tactics can be used to reduce the bias of a highly biased model.
Model Variance: Variance is how sensitive the model is to the data on which it was trained. A machine learning method can have a high or a low variance when creating a model on a dataset. A tactic to reduce the variance of a model is to run it multiple times on a dataset with different initial conditions and take the average accuracy as the models performance.
Bias-Variance Tradeoff: Model selection can be thought of as a the trade-off of the bias and variance. A low bias model will have a high variance and will need to be trained for a long time or many times to get a usable model. A high bias model will have a low variance and will train quickly, but suffer poor and limited performance.
Below are some resources if you would like to dig deeper.
- Tom Mitchell, The need for biases in learning generalizations, 1980
- Understanding the Bias-Variance Tradeoff
This post provided a useful glossary of terms that you can refer back to anytime for a clear definition.
Are there terms missing? Do you have a clearer description of one of the terms listed? Leave a comment and let us all know.
This is nice. Thanks.
Thanks for saying so Bruce. Let me know if you would like me to go deeper on a particular term.
I would make the distinction between validation and testing datasets. You train your model on your training set, you use a validation set to tune the model parameters, and you use a test set to asses the accuracy of your model. Being careful your test set does not influence the modelling process in any way.
Dirk, the “tuning” on the validation set is for the model’s hyper-parameters, not for the actual “parameters” of the model! Depending on the type of model there will be different hyper-parameters to tune (the regularization parameter in a cost function for example).
Very good and consistent work. Each of these followed by practical applications would help those of us that are more visual vs textual learners. Im interested in learning to program so more examples with these well thought out explanation might be a good starting point.
How can we use this to become Quants and pay some bills without working for others. I want to work with others not for others with these skills.
Thanks again wonderful work.
Sorry, I cannot give you advice on finance or becoming a quant – it is not my area of expertise.
Hi Jason – if the data set has all categorical values,how do we apply algorithms as most of the algorithms wont handle categorical values directly..
Could you please apply vectorisation technique in python.
Can you please elaborate on the difference between “Actual Parameters” and “Hyper Parameters” of a model..?
Machine learning algorithms learn coefficients from data, like coefficients in linear regression to describe a line. These are the model parameters.
To learn the coefficients, we often use a learning algorithm, like stochastic gradient descent. This algorithm can have parameters to control learning, like a learning rate. The learning rate is an example of a hyperparameter.
A really nice tutorial,
Is bias error? I’m still not sure what it means in this context. I thought of bias as the average difference between the expected value of the estimate and the average value of the real value.
Bias is error and so is variance. The are different types of error.
You can correlate Variance and Bias with Speed and Accuracy. If the variance is high, your model will take huge time to train. But if the Bias is high, your model will be poor performance.
I’m working as an SEO analyst.
how machine learning impact seo.?
what are the benefits of ML in SEO.?
I’m not sure, SEO is not my area. Perhaps ML can help for a secondary problem – e.g. analyzing a SERP and SERP movements or modeling impact across a suite of similar web sites.
Short, concise and to the point summary of machine learning fundamentals!
Jason, I believe this post should go as sticky notes for all of your books.
Nice article, Jason.
I’ve got a question for you if you don’t mind: How do you think the volume of data affects some machine learning algorithms? For instance, do you think having more data is always good or there’s a point where the gains become so small due to the performance of the model plateaus?
Thanks in advance!
A really good question.
I have some notes here that might help (e.g. the section on sensitivity analysis of dataset size):
Very useful piece. This has further enhance my knowledge in machine learning, thank you so much.. if i may ask, what are the instances on how input data can be defined?
An instance is one row of data, or one observation.
This post will show you how to define your data:
Very helpful piece. This has further enhanced my knowledge on Machine learning. Thank you so much. Please, can you tell me the instances on how input data can be defined.
Awaiting your response.
Thank you in advance.
This post will help you define your input data:
I read in one article that they divide their datasets in to online and offline. Can you please explain the difference between online and offline datasets. How can I divide my datasets in to online and offline datasets for research purpose? Thank you…
I have not heard of that distinction regarding data, sorry.
In regards to the bias-variance tradeoff, if you were to skew one way or the other, would it make sense to be high on the variance and low on the bias? sure, this will result in over-fitting, but at least that way you get what you need with some noise thrown in. with under-fitting you don’t get anything useful at all.
For example, neural nets are stochastic and have a higher bias in their final fit than other models – e.g. the same algorithm on the same data will result in a different fit each run.
We can address this by using an ensemble of final models to reduce the variance and increase the bias.
What is model in machine learning?
What is model in supervised learning context?
What is model in unsupervised learning context?
I explain more here:
am a little bit confused about model and algorithm could you please give piece of explanation regarding that if you can?
Yes, I explain the difference here:
Can you build me a compound algorithm prediction database i do follow the pattern the way drawing are be coming out in new york pick3 and pick4. Can you let me know if you can i will give you all the explanation and details
I do not have the capacity to build you an algorithm.
Thanks Jason but still me I have not grasped learning well
Thank you, it is very informative
I have a question:
I have a dataset 90×42 dimension.
I do not know how to find the dependent and independent dataset? What is a good solution to find dependent and independent features?
I do not know which algorithm of machine learning is good.
This will help:
And then this:
Thats a very good and amazing overview of the topic