Visualize Machine Learning Data in Python With Pandas

You must understand your data in order to get the best results from machine learning algorithms.

The fastest way to learn more about your data is to use data visualization.

In this post you will discover exactly how you can visualize your machine learning data in Python using Pandas.

Let’s get started.

Visualize Machine Learning Data in Python With Pandas

Visualize Machine Learning Data in Python With Pandas
Photo by Alex Cheek, some rights reserved.

About The Recipes

Each recipe in this post is complete and standalone so that you can copy-and-paste it into your own project and use it immediately.

The Pima Indians dataset is used to demonstrate each plot. This dataset describes the medical records for Pima Indians and whether or not each patient will have an onset of diabetes within five years. As such it is a classification problem.

It is a good dataset for demonstration because all of the input attributes are numeric and the output variable to be predicted is binary (0 or 1).

The data is freely available from the UCI Machine Learning Repository and is downloaded directly as part of each recipe.

Beat Information Overload and Master the Fastest Growing Platform of Machine Learning Pros


Machine Learning Mastery With Python Mini-CourseGet my free Machine Learning With Python mini course and start loading your own datasets from CSV in just 1 hour.

Daily lessons in your inbox for 14 days, and a Machine-Learning-With-Python “Cheat Sheet” you can download right now.

Download Your FREE Mini-Course >>

 

Univariate Plots

In this section we will look at techniques that you can use to understand each attribute independently.

Histograms

A fast way to get an idea of the distribution of each attribute is to look at histograms.

Histograms group data into bins and provide you a count of the number of observations in each bin. From the shape of the bins you can quickly get a feeling for whether an attribute is Gaussian’, skewed or even has an exponential distribution. It can also help you see possible outliers.

We can see that perhaps the attributes age, pedi and test may have an exponential distribution. We can also see that perhaps the mass and pres and plas attributes may have a Gaussian or nearly Gaussian distribution. This is interesting because many machine learning techniques assume a Gaussian univariate distribution on the input variables.

Univariate Histograms

Univariate Histograms

Density Plots

Density plots are another way of getting a quick idea of the distribution of each attribute. The plots look like an abstracted histogram with a smooth curve drawn through the top of each bin, much like your eye tried to do with the histograms.

We can see the distribution for each attribute is clearer than the histograms.

Univariate Density Plots

Univariate Density Plots

Box and Whisker Plots

Another useful way to review the distribution of each attribute is to use Box and Whisker Plots or boxplots for short.

Boxplots summarize the distribution of each attribute, drawing a line for the median (middle value) and a box around the 25th and 75th percentiles (the middle 50% of the data). The whiskers give an idea of the spread of the data and dots outside of the whiskers show candidate outlier values (values that are 1.5 times greater than the size of spread of the middle 50% of the data).

We can see that the spread of attributes is quite different. Some like age, test and skin appear quite skewed towards smaller values.

Univariate Box and Whisker Plots

Univariate Box and Whisker Plots

Multivariate Plots

This section shows examples of plots with interactions between multiple variables.

Correlation Matrix Plot

Correlation gives an indication of how related the changes are between two variables. If two variables change in the same direction they are positively correlated. If the change in opposite directions together (one goes up, one goes down), then they are negatively correlated.

You can calculate the correlation between each pair of attributes. This is called a correlation matrix. You can then plot the correlation matrix and get an idea of which variables have a high correlation with each other.

This is useful to know, because some machine learning algorithms like linear and logistic regression can have poor performance if there are highly correlated input variables in your data.

We can see that the matrix is symmetrical, i.e. the bottom left of the matrix is the same as the top right. This is useful as we can see two different views on the same data in one plot. We can also see that each variable is perfectly positively correlated with each other (as you would expected) in the diagonal line from top left to bottom right.

Correlation Matrix Plot

Correlation Matrix Plot

Scatterplot Matrix

A scatterplot shows the relationship between two variables as dots in two dimensions, one axis for each attribute. You can create a scatterplot for each pair of attributes in your data. Drawing all these scatterplots together is called a scatterplot matrix.

Scatter plots are useful for spotting structured relationships between variables, like whether you could summarize the relationship between two variables with a line. Attributes with structured relationships may also be correlated and good candidates for removal from your dataset.

Like the Correlation Matrix Plot, the scatterplot matrix is symmetrical. This is useful to look at the pair-wise relationships from different perspectives. Because there is little point oi drawing a scatterplot of each variable with itself, the diagonal shows histograms of each attribute.

Scatterplot Matrix

Scatterplot Matrix

Summary

In this post you discovered a number of ways that you can better understand your machine learning data in Python using Pandas.

Specifically, you learned how to plot your data using:

  • Histograms
  • Density Plots
  • Box and Whisker Plots
  • Correlation Matrix Plot
  • Scatterplot Matrix

Open your Python interactive environment and try out each recipe.

Do you have any questions about Pandas or the recipes in this post? Ask in the comments and I will do my best to answer.

Frustrated With Python Machine Learning?

Develop Your Own Models and Predictions in Minutes

...with just a few lines of scikit-learn code

Discover how in my new Ebook: Machine Learning Mastery With Python

It covers self-study tutorials and end-to-end projects on topics like:
Loading data, visualization, modeling, algorithm tuning, and much more...

Finally Bring Machine Learning To
Your Own Projects

Skip the Academics. Just Results.

Click to learn more.

 

6 Responses to Visualize Machine Learning Data in Python With Pandas

  1. saimadhu September 1, 2016 at 4:14 pm #

    Hi Jason Brownlee,

    Thanks for this post. Till now I using different python visuvalization libraries like matplotlib , plotly or seaborn for getting more out of the data which I have loaded into pandas dataframe. Till now I am not aware of using the pandas itself for visulzation.

    From now onwards I am gonna use your recipe for visualization.

    • Jason Brownlee September 2, 2016 at 8:05 am #

      I’m glad the post was useful saimadhu.

    • naresh October 15, 2016 at 10:32 pm #

      Hello Jason,
      what we can deduce from class variable box plot, why and when we get this kind of plot.

      • Jason Brownlee October 17, 2016 at 10:20 am #

        Great question naresh.

        Box plots are great for getting a snapshot of the spread of the data and where the meat of the data is on the scale. It also quickly helps you spot outliers (outside the whiskers or really > or < 1.5 x IQR).

  2. naresh October 16, 2016 at 12:31 am #

    Hello jason ,
    While I try to create correlation matrix for my own dataset having 12 variables, however in matrix only 7 variables have colored matrix and left 5 have white color.I just change this
    “ticks=np.arange(0,12,1)” form 9 to 12 ,

    import numpy as np
    names=[‘PassId’,’Sur’,’Pclas’,’Name’,’Sex’,’Age’,’SibSp’,’Parch’,’Ticket’,’Fare’,’Cabin’,’Emb’]
    correlation=train.corr()
    #create a correlation matrix
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(correlation, vmin=-1, vmax =1)
    fig.colorbar(cax)
    ticks=np.arange(0,12,1)
    ax.set_xticks(ticks)
    ax.set_yticks(ticks)
    ax.set_xticklabels(names)
    ax.set_yticklabels(names)
    plt.show()

    similar case is with scatter plot ,could you please let me know where I have the issue
    and also one more thing how we decide which scatter plot is highly valuable

    • Jason Brownlee October 17, 2016 at 10:21 am #

      Great question naresh, I don’t know off the top of my head.

      I would suggest looking into how to specify your own color lists to the function. Perhaps the limit is 6-7 defaults.

Leave a Reply