Hill climbing the test set is an approach to achieving good or perfect predictions on a machine learning competition without touching the training set or even developing a predictive model. As an approach to machine learning competitions, it is rightfully frowned upon, and most competition platforms impose limitations to prevent it, which is important. Nevertheless, […]
Archive | Data Preparation
How to Train to the Test Set in Machine Learning
Training to the test set is a type of overfitting where a model is prepared that intentionally achieves good performance on a given test set at the expense of increased generalization error. It is a type of overfitting that is common in machine learning competitions where a complete training dataset is provided and where only […]
How to Selectively Scale Numerical Input Variables for Machine Learning
Many machine learning models perform better when input variables are carefully transformed or scaled prior to modeling. It is convenient, and therefore common, to apply the same data transforms, such as standardization and normalization, equally to all input variables. This can achieve good results on many problems. Nevertheless, better results may be achieved by carefully […]
Add Binary Flags for Missing Values for Machine Learning
Missing values can cause problems when modeling classification and regression prediction problems with machine learning algorithms. A common approach is to replace missing values with a calculated statistic, such as the mean of the column. This allows the dataset to be modeled as per normal but gives no indication to the model that the row […]
How to Create Custom Data Transforms for Scikit-Learn
The scikit-learn Python library for machine learning offers a suite of data transforms for changing the scale and distribution of input data, as well as removing input features (columns). There are many simple data cleaning operations, such as removing outliers and removing columns with few observations, that are often performed manually to the data, requiring […]
How to Grid Search Data Preparation Techniques
Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling. The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithms, then carefully choose the most appropriate data […]
Framework for Data Preparation Techniques in Machine Learning
There are a vast number of different types of data preparation techniques that could be used on a predictive modeling project. In some cases, the distribution of the data or the requirements of a machine learning model may suggest the data preparation needed, although this is rarely the case given the complexity and high-dimensionality of […]
6 Dimensionality Reduction Algorithms With Python
Dimensionality reduction is an unsupervised learning technique. Nevertheless, it can be used as a data transform pre-processing step for machine learning algorithms on classification and regression predictive modeling datasets with supervised learning algorithms. There are many dimensionality reduction algorithms to choose from and no single best algorithm for all cases. Instead, it is a good […]
4 Automatic Outlier Detection Algorithms in Python
The presence of outliers in a classification or regression dataset can result in a poor fit and lower predictive modeling performance. Identifying and removing outliers is challenging with simple statistical methods for most machine learning datasets given the large number of input variables. Instead, automatic outlier detection methods can be used in the modeling pipeline […]
How to Use Feature Extraction on Tabular Data for Machine Learning
Machine learning predictive modeling performance is only as good as your data, and your data is only as good as the way you prepare it for modeling. The most common approach to data preparation is to study a dataset and review the expectations of a machine learning algorithm, then carefully choose the most appropriate data […]