Data, Learning and Modeling

There are key concepts in machine learning that lay the foundation for understanding the field.

In this post, you will learn the nomenclature (standard terms) that is used when describing data and datasets.

You will also learn the concepts and terms used to describe learning and modeling from data that will provide a valuable intuition for your journey through the field of machine learning.

Data

Machine learning methods learn from examples. It is important to have good grasp of input data and the various terminology used when describing data. In this section, you will learn the terminology used in machine learning when referring to data.

When I think of data, I think of rows and columns, like a database table or an Excel spreadsheet. This is a traditional structure for data and is what is common in the field of machine learning. Other data like images, videos, and text, so-called unstructured data is not considered at this time.

Table of Data Showing an Instance, Feature, and Train-Test Datasets

Table of Data Showing an Instance, Feature, and Train-Test Datasets

Instance: A single row of data is called an instance. It is an observation from the domain.

Feature: A single column of data is called a feature. It is a component of an observation and is also called an attribute of a data instance. Some features may be inputs to a model (the predictors) and others may be outputs or the features to be predicted.

Data Type: Features have a data type. They may be real or integer-valued or may have a categorical or ordinal value. You can have strings, dates, times, and more complex types, but typically they are reduced to real or categorical values when working with traditional machine learning methods.

Datasets: A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.

Training Dataset: A dataset that we feed into our machine learning algorithm to train our model.

Testing Dataset: A dataset that we use to validate the accuracy of our model but is not used to train the model. It may be called the validation dataset.

We may have to collect instances to form our datasets or we may be given a finite dataset that we must split into sub-datasets.

Learning

Machine learning is indeed about automated learning with algorithms.

In this section, we will consider a few high-level concepts about learning.

Induction: Machine learning algorithms learn through a process called induction or inductive learning. Induction is a reasoning process that makes generalizations (a model) from specific information (training data).

Generalization: Generalization is required because the model that is prepared by a machine learning algorithm needs to make predictions or decisions based on specific data instances that were not seen during training.

Over-Learning: When a model learns the training data too closely and does not generalize, this is called over-learning. The result is poor performance on data other than the training dataset. This is also called over-fitting.

Under-Learning: When a model has not learned enough structure from the database because the learning process was terminated early, this is called under-learning. The result is good generalization but poor performance on all data, including the training dataset. This is also called under-fitting.

Online Learning: Online learning is when a method is updated with data instances from the domain as they become available. Online learning requires methods that are robust to noisy data but can produce models that are more in tune with the current state of the domain.

Offline Learning: Offline learning is when a method is created on pre-prepared data and is then used operationally on unobserved data. The training process can be controlled and can tuned carefully because the scope of the training data is known. The model is not updated after it has been prepared and performance may decrease if the domain changes.

Supervised Learning: This is a learning process for generalizing on problems where a prediction is required. A “teaching process” compares predictions by the model to known answers and makes corrections in the model.

Unsupervised Learning: This is a learning process for generalizing the structure in the data where no prediction is required. Natural structures are identified and exploited for relating instances to each other.

We have covered supervised and unsupervised learning before in the post on machine learning algorithms. These terms can be useful for classifying algorithms by their behavior.

Modeling

The artefact created by a machine learning process could be considered a program in its own right.

Model Selection: We can think of the process of configuring and training the model as a model selection process. Each iteration we have a new model that we could choose to use or to modify. Even the choice of machine learning algorithm is part of that model selection process. Of all the possible models that exist for a problem, a given algorithm and algorithm configuration on the chosen training dataset will provide a finally selected model.

Inductive Bias: Bias is the limits imposed on the selected model. All models are biased which introduces error in the model, and by definition all models have error (they are generalizations from observations). Biases are introduced by the generalizations made in the model including the configuration of the model and the selection of the algorithm to generate the model. A machine learning method can create a model with a low or a high bias and tactics can be used to reduce the bias of a highly biased model.

Model Variance: Variance is how sensitive the model is to the data on which it was trained. A machine learning method can have a high or a low variance when creating a model on a dataset. A tactic to reduce the variance of a model is to run it multiple times on a dataset with different initial conditions and take the average accuracy as the models performance.

Bias-Variance Tradeoff: Model selection can be thought of as a the trade-off of the bias and variance. A low bias model will have a high variance and will need to be trained for a long time or many times to get a usable model. A high bias model will have a low variance and will train quickly, but suffer poor and limited performance.

Resources

Below are some resources if you would like to dig deeper.

This post provided a useful glossary of terms that you can refer back to anytime for a clear definition.

Are there terms missing? Do you have a clearer description of one of the terms listed? Leave a comment and let us all know.

27 Responses to Data, Learning and Modeling

  1. Bruce December 20, 2013 at 4:51 pm #

    This is nice. Thanks.

    • jasonb December 26, 2013 at 8:33 pm #

      Thanks for saying so Bruce. Let me know if you would like me to go deeper on a particular term.

  2. Dirk July 9, 2014 at 12:09 am #

    I would make the distinction between validation and testing datasets. You train your model on your training set, you use a validation set to tune the model parameters, and you use a test set to asses the accuracy of your model. Being careful your test set does not influence the modelling process in any way.

    • NicoT March 10, 2015 at 3:37 am #

      Dirk, the “tuning” on the validation set is for the model’s hyper-parameters, not for the actual “parameters” of the model! Depending on the type of model there will be different hyper-parameters to tune (the regularization parameter in a cost function for example).

  3. leema April 25, 2016 at 3:52 pm #

    Hi Jason – if the data set has all categorical values,how do we apply algorithms as most of the algorithms wont handle categorical values directly..

    Could you please apply vectorisation technique in python.

    Regards
    Leema Jose

  4. Gunjeet Singh November 22, 2016 at 11:09 pm #

    Can you please elaborate on the difference between “Actual Parameters” and “Hyper Parameters” of a model..?

    • Jason Brownlee November 23, 2016 at 9:00 am #

      Hi Gunjeet,

      Machine learning algorithms learn coefficients from data, like coefficients in linear regression to describe a line. These are the model parameters.

      To learn the coefficients, we often use a learning algorithm, like stochastic gradient descent. This algorithm can have parameters to control learning, like a learning rate. The learning rate is an example of a hyperparameter.

  5. Jacob Smith June 27, 2017 at 7:10 am #

    Is bias error? I’m still not sure what it means in this context. I thought of bias as the average difference between the expected value of the estimate and the average value of the real value.

  6. saravanan August 23, 2017 at 5:11 pm #

    hi Jason

    I’m working as an SEO analyst.

    how machine learning impact seo.?

    what are the benefits of ML in SEO.?

    • Jason Brownlee August 24, 2017 at 6:26 am #

      I’m not sure, SEO is not my area. Perhaps ML can help for a secondary problem – e.g. analyzing a SERP and SERP movements or modeling impact across a suite of similar web sites.

  7. Anurag January 20, 2018 at 5:16 am #

    Short, concise and to the point summary of machine learning fundamentals!

    Jason, I believe this post should go as sticky notes for all of your books.

  8. Jesús Martínez February 6, 2018 at 12:05 am #

    Nice article, Jason.

    I’ve got a question for you if you don’t mind: How do you think the volume of data affects some machine learning algorithms? For instance, do you think having more data is always good or there’s a point where the gains become so small due to the performance of the model plateaus?

    Thanks in advance!

  9. Abiodun Abiodun April 24, 2018 at 8:52 pm #

    Jason,

    Very useful piece. This has further enhance my knowledge in machine learning, thank you so much.. if i may ask, what are the instances on how input data can be defined?

  10. Abiodun Abioye April 24, 2018 at 10:03 pm #

    Hi Jason,

    Very helpful piece. This has further enhanced my knowledge on Machine learning. Thank you so much. Please, can you tell me the instances on how input data can be defined.

    Awaiting your response.

    Thank you in advance.

  11. Anna June 19, 2018 at 3:55 pm #

    Sir,
    I read in one article that they divide their datasets in to online and offline. Can you please explain the difference between online and offline datasets. How can I divide my datasets in to online and offline datasets for research purpose? Thank you…

    • Jason Brownlee June 20, 2018 at 6:21 am #

      I have not heard of that distinction regarding data, sorry.

  12. Paul A. Gureghian August 2, 2018 at 4:38 am #

    In regards to the bias-variance tradeoff, if you were to skew one way or the other, would it make sense to be high on the variance and low on the bias? sure, this will result in over-fitting, but at least that way you get what you need with some noise thrown in. with under-fitting you don’t get anything useful at all.

    • Jason Brownlee August 2, 2018 at 6:04 am #

      Yes.

      For example, neural nets are stochastic and have a higher bias in their final fit than other models – e.g. the same algorithm on the same data will result in a different fit each run.

      We can address this by using an ensemble of final models to reduce the variance and increase the bias.

  13. Ajay September 4, 2018 at 7:56 pm #

    Hello Sir,

    What is model in machine learning?
    What is model in supervised learning context?
    What is model in unsupervised learning context?

    Thank you

  14. adane gebru fkadu November 17, 2018 at 5:01 pm #

    hello sir,
    am a little bit confused about model and algorithm could you please give piece of explanation regarding that if you can?

    thanks

Leave a Reply