Last Updated on
Machine learning algorithms are complex. To get good at applying a given algorithm you need to study it from multiple perspectives: algorithmic, mathematical and empirical.
It’s this last point I want to stress. You need to build up an intuition or how an algorithm behaves on real data. You need to work on lots of problems.
In this post I want to encourage you to use small in-memory datasets when starting out and when practising machine learning.
Discover how machine learning algorithms work including kNN, decision trees, naive bayes, SVM, ensembles and much more in my new book, with 22 tutorials and examples in excel.
Study an Algorithm or a Problem, Not Both
You can’t learn a problem and an algorithm at the same time.
If you try, you will progress on both slowly and inefficiently. Your focus will be divided and nether task are being executed ideally.
You will known when you’re in this track because you will oscillate between diving deep into the problem and deep into a specific algorithm. You will be frustrated and overwhelmed. You’re taking on too much.
Get your FREE Algorithms Mind Map
I've created a handy mind map of 60+ algorithms organized by type.
Download it, print it and use it.
Also get exclusive access to the machine learning algorithms email mini-course.
Split Your Concerns
The best course of action is to study the algorithm and the problem separately.
You study the problem by using algorithms to learn more about it and posit candidate solutions in the form of models. This means you will be experimenting a lot of models (spot checking) and likely a lot of algorithm configurations (tuning).
You study an algorithm by focusing on one problem dataset and using it to learn more about the interactions of the algorithms parameters and their effects on the model, such as the final result or behaviour over time.
It is this second type of project were you can use empirical experiments to build an intuition into how machine learning algorithms work. You can pair this intuition with theory of why they work and aim to make informed decisions around which algorithm to use and when for a given problem in the future.
Play the Scientist
You are looking to characterize the behaviours of the algorithm as a system on a controlled problem.
The focus of the study is a question, such as:
What is the information processing strategy of the algorithm?
How does the system behave when a given parameter is varied?
Clearly define the specific question you intend to answer with your study before you gets started. Be clear on what form the answer will take.
Studying algorithms has some specific tangible benefits that improve your machine learning skills, such as:
- Algorithm Tuning: You are learning how the algorithm behaves as a complex system and the influence the algorithm parameters have on those behaviors. These are invaluable insights and intuitions needed for tuning the algorithm on specific problem instances.
- Problem-Algorithm fit: You are learning about the classes of algorithms and specific algorithm instances that perform well on classes of problems and problem instances. This is an intuition that can only be built up from experience.
- Project Life-cycle: You are practising the process of applied machine learning from data preparation, algorithm testing and tuning and the presentation of results.
They key is having standard well understood datasets that you can use to better understand the algorithm under study.
Use Standard Datasets
You can use one or a small number of model datasets to study a machine learning algorithm.
Sometimes they are called toy datasets or toy problems, because of their size. Nevertheless, they play an important role when you are learning about and practising machine learning algorithms.
Different datasets have different known properties. It is often desirable to select a small set of those properties to expose different behaviours of an algorithm under study.
For example some properties may include
- Number of Features
- Class Distribution
- Data Types
- Structured Relationship
5 Benefits of Model Datasets
Below are 5 benefits you get in using standard machine learning datasets.
- Small: The dataset can fit into memory. This means you can run a lot of experiments, quickly and in turn learn about the algorithm quickly.
- Understood: The dataset is generally understood. It may have significant literature behind it or be a common point of test and study for algorithms. It has known properties for testing the capability of an algorithm.
- Controlled: A model dataset constant and provides the basis for controlled experiments. The behavior of the algorithm can be varied to see the effects on the results against the well understood problem.
- Free: Model datasets are available for download. You do not need permission or to pay a license fee. The common data sets are available for you to use whenever you need.
- Simple: The structure or relationships in the data are not complex. They can be easily understood, described with summary statistics and graphs. There are typically few variables.
UCI Machine Learning Repository
Some tools come with sample datasets, but one great source that you can trust to be consistent is the University of California Irvine Machine Learning Repository.
It is a website that hosts hundreds of standard machine learning datasets used in academia for testing, demonstrating and empirically characterizing the behaviours of machine learning algorithms.
You can browse datasets on this site, look at the data, and review papers and articles that have made reference to the dataset.
It is a valuable resources that you can use to find datasets to study a machine learning algorithm.
5 Classic Model Datasets
Below are a list of 5 class datasets that I like to use when getting familiar with a new algorithm or an old algorithm I’ve forgotten about.
- Iris Flower: Describes iris flower in terms of the dimensions of the flowers divided into three species classes.
- Ionosphere: Describes radar return data characterizing engergy states in the ionosphere. All attributes are numeric and the class is binary.
- Pima Indians Diabetes: Varied medical record data for Pima Indians with a binary class of whether the patient had an onset of diabetes within 5 years from when the medical data was collected.
- Glass Identification: Identification of class based on the chemical composition of samples, multiple unbalanced classes.
- Wisconson Breast Cancer: Medical biopsy information from breast cancer patients and a binary class variable of whether the sample was cancerous.
You may find one or more of these datasets useful in your own experiments.
In this post you discovered the difficulties when attempting to learn about a problem dataset and an algorithm at the same time. In fact, they are competing concerns.
You discovered that the answer is to separate those concerns into learning about your problem and learning about an algorithm, and being clear on what your goals are.
You discovered the benefits of small model datasets when learning about an algorithm, where to get standard machine learning datasets and some popular examples you could start with.
If you would like to know more about how to study machine learning algorithms, take a look at my algorithm description template for learning any algorithm and small projects methodology guides for self-study projects including studying algorithms.