Using machine learning to solve real-world problems is exciting. But most eager beginners jump straight to model building—overlooking the fundamentals—resulting in models that aren’t very helpful. From understanding the data to choosing the best machine learning model for the problem, there are some common mistakes that beginners often tend to make.
But before we go over them, you should understand the problem—it is step 0 if you will—you are trying to solve. Ask yourself enough questions to learn more about the problem and the domain. Also consider if machine learning is necessary at all: start without machine learning if needed before mapping out how to solve the problem using machine learning.
This article focuses on five common mistakes—across different steps—in machine learning and how to avoid them. We will not work with a specific dataset but will whip up simple generic code snippets as needed to demonstrate how to avoid these common pitfalls. Let’s get started.
1. Not Understanding the Data
Understanding the data is a fundamental—and should be the first—step in any machine learning project. Without a good understanding of the data you’re working with, you risk making incorrect decisions on preprocessing techniques, feature engineering and selection, and model building.
Insufficient understanding of the data can be due to many reasons. Here are some of them:
- Lack of domain and contextual knowledge can make understanding the relevance of the various features in the dataset difficult.
- Not analyzing the distribution of the data and the presence of outliers can lead to ineffective preprocessing and model training.
- Without understanding how features relate to each other (again stemming from lack of context), you might miss out on important relationships that can improve your model’s performance.
This can result in models that do not perform well and, consequently, not very helpful in solving the problem.
How to Avoid
Use summary statistics to get an overview of the numerical features in your dataset. This includes metrics like mean, median, standard deviation, and more. To get summary statistics, you can call the describe method on the pandas dataframe containing the data:
1 2 3 4 5 6 7 |
import pandas as pd # Load your dataset df = pd.read_csv('your_dataset.csv') # Display summary statistics print(df.describe()) |
Also use visualizations to understand distributions of numerical features and categorical variables to identify patterns and outliers. Here’s the code to plot the distribution and count plots of numerical and categorical features in the dataset:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
import seaborn as sns import matplotlib.pyplot as plt # Plot distributions of numeric features numeric_features = df.select_dtypes(include=['int64', 'float64']).columns for feature in numeric_features: sns.histplot(df[feature], kde=True) plt.title(f'{feature} Distribution') plt.show() # Plot counts of categorical features categorical_features = df.select_dtypes(include=['object', 'category']).columns for feature in categorical_features: sns.countplot(x=feature, data=df) plt.title(f'{feature} Distribution') plt.show() |
Understanding your data through a thorough exploratory data analysis will help you make more informed decisions during the preprocessing and feature engineering steps.
2. Insufficient Data Preprocessing
Real-world datasets are rarely usable in their native form and often require extensive cleaning and preprocessing to make them suitable for training a machine learning model on.
Common data preprocessing mistakes include:
- Ignoring or improperly handling missing values. This can introduce bias making the model less useful.
- Not handling outliers can skew the results, particularly in models sensitive to the range and distribution of the data. Machine learning algorithms that use distance metrics, such as K-Nearest Neighbors, are especially sensitive to outliers.
- Using incorrect encoding methods for categorical variables can result in a loss of information or create misleading patterns.
Avoiding these data preprocessing pitfalls is, therefore, essential for preparing the data for modeling.
How to Avoid
First, let’s split the data into train and test sets as shown:
1 2 3 4 5 6 |
from sklearn.model_selection import train_test_split # Assuming 'Target' is the target variable X = df.drop('Target', axis=1) y = df['Target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
Handle missing values: Handle missing values appropriately using methods like mean, median, or mode imputation for numerical and categorical features.
Let’s impute the missing values in numerical and categorical columns with the mean and most frequently occurring values, respectively.
First, you fit and apply the imputers on the training data:
1 2 3 4 5 6 7 8 9 10 11 |
from sklearn.impute import SimpleImputer # Define and fit imputer for numerical features on training data numeric_features = X.select_dtypes(include=['int64', 'float64']).columns numeric_imputer = SimpleImputer(strategy='mean') X_train[numeric_features] = numeric_imputer.fit_transform(X_train[numeric_features]) # Define and fit imputer for categorical features on training data categorical_features = X.select_dtypes(include=['object', 'category']).columns categorical_imputer = SimpleImputer(strategy='most_frequent') X_train[categorical_features] = categorical_imputer.fit_transform(X_train[categorical_features]) |
Then, you transform the test dataset using the imputers fit on the training data like so:
1 2 3 4 5 |
# Transform the test data using the numeric imputer X_test[numeric_features] = numeric_imputer.transform(X_test[numeric_features]) # Transform the test data using the categorical imputer X_test[categorical_features] = categorical_imputer.transform(X_test[categorical_features]) |
Note: Notice how we do not use any information from the test dataset during preprocessing when calling the fit_transform() method. If we do, then there’ll be data leakage from the test set into the data used to train the model. Data leakage is more common than you think and we’ll talk about it later in this guide.
Scale numeric features: Your features should all be on the same scale when you feed them to the machine learning model. Standardize or normalize features as required. For this, you can use MinMaxScaler and StandardScaler from scikit-learn’s preprocessing module.
Here’s how you can standardize numerical features such that they follow a distribution with zero mean and unit variance:
1 2 3 4 5 6 7 8 9 |
from sklearn.preprocessing import StandardScaler # Define and fit scaler for numerical features on training data numeric_features = X.select_dtypes(include=['int64', 'float64']).columns scaler = StandardScaler() X_train[numeric_features] = scaler.fit_transform(X_train[numeric_features]) # Transform the test data using the fitted scaler X_test[numeric_features] = scaler.transform(X_test[numeric_features]) |
Encode categorical variables: You should encode categorical variables—convert them to numerical representation—before you feed them to the machine learning model. You can use:
- One-hot encoding for simple categorical variables.
- Ordinal encoding if there’s an inherent ordering among the values of the variables.
- Label encoding to encode target labels.
To learn more about encoding, read Ordinal and One-Hot Encodings for Categorical Data.
This is not an exhaustive list of processing steps. But you should do them all before you can proceed to feature engineering.
3. Lack of Feature Engineering
Feature engineering is the process of understanding and manipulating existing features and creating new representative features that better capture the underlying relationships between features in the data. But most beginners overlook this super important step.
Without effective feature engineering, the model might not capture the essential relationships in the data, leading to suboptimal performance:
- Not using domain knowledge to create meaningful features can limit the model’s effectiveness.
- Ignoring the creation of interaction features—based on meaningful relationships between features—can mean missing out on significant relationships between variables.
Feature engineering, therefore, is much more than handling missing values and outliers, scaling features, and encoding categorical variables.
How to Avoid
Here are some tips for feature engineering.
Create new features: Use domain-specific insights to create new features that capture important aspects of the data.
Here’s a simple example:
1 2 |
# Create a new feature as a ratio of two existing features df['New_Feature'] = df['Feature1'] / df['Feature2'] |
Create interaction features: Create features that represent interactions between existing features. Here’s an example that generates and adds interaction features—products of pairs of numeric features—to the dataframe using the PolynomialFeatures class:
1 2 3 4 5 6 |
from sklearn.preprocessing import PolynomialFeatures interaction = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False) interaction_features = interaction.fit_transform(df[numeric_features]) interaction_df = pd.DataFrame(interaction_features, columns=interaction.get_feature_names(numeric_features)) df = pd.concat([df, interaction_df], axis=1) |
Create aggregated features: It can sometimes be helpful to create aggregated features such as ratios, differences, or rolling statistics. The following code calculates the moving average of the ‘Feature’ column over three consecutive data points:
1 |
df['Rolling_Mean'] = df['Feature'].rolling(window=3).mean() |
For a more detailed overview of feature engineering, read Discover Feature Engineering, How to Engineer Features and How to Get Good at It.
4. Data Leakage
Data leakage is a subtle (but super common) problem in machine learning which occurs when your model uses information outside of the training dataset during the training phase. If you recall, we did touch on this when we talked about preprocessing the dataset.
Data leakage results in models with overly optimistic performance estimates and models that perform poorly on (truly) unseen data. This occurs due to reasons such as:
- Using test data or information from the test data during training or validation
- Applying preprocessing steps before splitting the data
This problem is relatively easier to avoid if you’re careful during the preprocessing steps.
How to Avoid
Let’s now discuss how to avoid data leakage.
Avoid preprocessing the full dataset: Always split the data into training and test sets before applying any preprocessing. Here’s how you can split the data into train and test sets:
1 2 3 4 5 |
from sklearn.model_selection import train_test_split X = df.drop('Target', axis=1) y = df['Target'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) |
Use Pipelines: Use pipelines to ensure that preprocessing steps are only applied to the training data. You can use pipelines in scikit-learn for such tasks.
Here’s an example pipeline to handle missing values and encode categorical variables:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.ensemble import RandomForestRegressor numeric_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='mean')), ('scaler', StandardScaler())]) categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='most_frequent')), ('onehot', OneHotEncoder(handle_unknown='ignore'))]) preprocessor = ColumnTransformer( transformers=[ ('num', numeric_transformer, numeric_features), ('cat', categorical_transformer, categorical_features)]) pipeline = Pipeline(steps=[ ('preprocessor', preprocessor), ('model', RandomForestRegressor(random_state=42)) ]) pipeline.fit(X_train, y_train) |
Preventing data leakage by properly splitting the data and using pipelines ensure that your model’s performance metrics are accurate and reliable. Read Modeling Pipeline Optimization With scikit-learn to learn more about improving your workflow with pipelines.
5. Underfitting and Overfitting
Underfitting and overfitting are both common problems you should avoid to build robust machine learning models.
Underfitting occurs when your model is too simple to capture the relationship between the input features and the output in the data. As a result, your model performs poorly on both the training and the test datasets.
Overfitting is when a model is too complex and captures needlessly complex noise instead of the actual patterns. If there’s overfitting, your machine learning model performs extremely well on training data but generalizes rather poorly to new data that it hasn’t seen before.
How to Avoid
Now let’s go over the solutions to overfitting and underfitting.
To avoid underfitting:
- Try increasing the model complexity. Even if you start with a simple model, gradually switch to a more complex model that can capture the patterns in the data better.
- Use feature engineering and add more relevant features to the model.
To avoid overfitting:
- Use cross-validation during model evaluation to ensure that the model generalizes well to unseen data.
- Try using a simpler model with fewer parameters.
- If you can, add more training data as it’ll help the model generalize better.
- Apply regularization techniques like L1 and L2 regularization to penalize large values of parameters.
Experimenting with models of varying complexity of the model and using regularization techniques are generally helpful in building robust models. Check out Tips for Choosing the Right Machine Learning Model for Your Data for practical advice on model selection in machine learning.
Summary
In this guide, we focused on common pitfalls that are problem agnostic and apply to machine learning tasks in general.
As discussed, when you use machine learning to solve business problems, be sure to keep the following in mind:
- Spend enough time understanding the dataset: the different features, their significance, and the most relevant subset of features for the problem.
- Apply the correct data cleaning and preprocessing techniques to handle missing values, outliers, and categorical variables. Scale numeric features as needed depending on the algorithm you’re using.
- In addition to preprocessing the existing features, you can also create new representative features that are more useful in making predictions.
- To avoid data leakage, make sure that you are not using any information from the test data in your model.
- It’s important to pick the model with the right complexity as models that are too simple or too complex are not very helpful.
Happy machine learning!
This is a great article but I believe that the readers would appreciate a more detailed description of the data quality issues plus the maths behind the idea about the train-test-unseen/future datas mathematical expectations, plus the idea of leakage as compared to introducing truly unseen simulated data or distributions to the model.
Regards
Thank you Jens for the outstanding feedback! We appreciate your contributions to our discussions!