Archive | Data Preparation

Line Plot of Accuracy vs. Hill Climb Optimization Iteration for the Diabetes Dataset

How to Hill Climb the Test Set for Machine Learning

Hill climbing the test set is an approach to achieving good or perfect predictions on a machine learning competition without touching the training set or even developing a predictive model. As an approach to machine learning competitions, it is rightfully frowned upon, and most competition platforms impose limitations to prevent it, which is important. Nevertheless, […]

Continue Reading
Histogram of Each Variable in the Diabetes Classification Dataset

How to Selectively Scale Numerical Input Variables for Machine Learning

Many machine learning models perform better when input variables are carefully transformed or scaled prior to modeling. It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully […]

Continue Reading
Dimensionality Reduction Algorithms With Python

6 Dimensionality Reduction Algorithms With Python

Dimensionality reduction is an unsupervised learning technique. Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms. There are many dimensionality reduction algorithms to choose from and no single best algorithm for all cases. Instead, it is a good […]

Continue Reading
Model-Based Outlier Detection and Removal in Python

4 Automatic Outlier Detection Algorithms in Python

The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance. Identifying and removing outliers is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline […]

Continue Reading