Understand Your Machine Learning Data With Descriptive Statistics in Python

By Jason Brownlee on December 11, 2019 in Python Machine Learning 40

You must understand your data in order to get the best results.

In this post you will discover 7 recipes that you can use in Python to learn more about your machine learning data.

Kick-start your project with my new book Machine Learning Mastery With Python, including step-by-step tutorials and the Python source code files for all examples.

Let’s get started.

Update Mar/2018: Added alternate link to download the dataset as the original appears to have been taken down.

Understand Your Machine Learning Data With Descriptive Statistics in Python
Photo by passer-by, some rights reserved.

Python Recipes To Understand Your Machine Learning Data

This section lists 7 recipes that you can use to better understand your machine learning data.

Each recipe is demonstrated by loading the Pima Indians Diabetes classification dataset.

Open your python interactive environment and try each recipe out in turn.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

1. Peek at Your Data

There is no substitute for looking at the raw data.

Looking at the raw data can reveal insights that you cannot get any other way. It can also plant seeds that may later grow into ideas on how to better preprocess and handle the data for machine learning tasks.

You can review the first 20 rows of your data using the head() function on the Pandas DataFrame.

# View first 20 rows
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
peek = data.head(20)
print(peek)

# View first 20 rows

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

peek = data.head(20)

print(peek)

You can see that the first column lists the row number, which is handy for referencing a specific observation.

    preg  plas  pres  skin  test  mass   pedi  age  class
0      6   148    72    35     0  33.6  0.627   50      1
1      1    85    66    29     0  26.6  0.351   31      0
2      8   183    64     0     0  23.3  0.672   32      1
3      1    89    66    23    94  28.1  0.167   21      0
4      0   137    40    35   168  43.1  2.288   33      1
5      5   116    74     0     0  25.6  0.201   30      0
6      3    78    50    32    88  31.0  0.248   26      1
7     10   115     0     0     0  35.3  0.134   29      0
8      2   197    70    45   543  30.5  0.158   53      1
9      8   125    96     0     0   0.0  0.232   54      1
10     4   110    92     0     0  37.6  0.191   30      0
11    10   168    74     0     0  38.0  0.537   34      1
12    10   139    80     0     0  27.1  1.441   57      0
13     1   189    60    23   846  30.1  0.398   59      1
14     5   166    72    19   175  25.8  0.587   51      1
15     7   100     0     0     0  30.0  0.484   32      1
16     0   118    84    47   230  45.8  0.551   31      1
17     7   107    74     0     0  29.6  0.254   31      1
18     1   103    30    38    83  43.3  0.183   33      0
19     1   115    70    30    96  34.6  0.529   32      1

preg plas pres skin test mass pedi age class

0 6 148 72 35 0 33.6 0.627 50 1

1 1 85 66 29 0 26.6 0.351 31 0

2 8 183 64 0 0 23.3 0.672 32 1

3 1 89 66 23 94 28.1 0.167 21 0

4 0 137 40 35 168 43.1 2.288 33 1

5 5 116 74 0 0 25.6 0.201 30 0

6 3 78 50 32 88 31.0 0.248 26 1

7 10 115 0 0 0 35.3 0.134 29 0

8 2 197 70 45 543 30.5 0.158 53 1

9 8 125 96 0 0 0.0 0.232 54 1

10 4 110 92 0 0 37.6 0.191 30 0

11 10 168 74 0 0 38.0 0.537 34 1

12 10 139 80 0 0 27.1 1.441 57 0

13 1 189 60 23 846 30.1 0.398 59 1

14 5 166 72 19 175 25.8 0.587 51 1

15 7 100 0 0 0 30.0 0.484 32 1

16 0 118 84 47 230 45.8 0.551 31 1

17 7 107 74 0 0 29.6 0.254 31 1

18 1 103 30 38 83 43.3 0.183 33 0

19 1 115 70 30 96 34.6 0.529 32 1

2. Dimensions of Your Data

You must have a very good handle on how much data you have, both in terms of rows and columns.

Too many rows and algorithms may take too long to train. Too few and perhaps you do not have enough data to train the algorithms.
Too many features and some algorithms can be distracted or suffer poor performance due to the curse of dimensionality.

You can review the shape and size of your dataset by printing the shape property on the Pandas DataFrame.

# Dimensions of your data
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
shape = data.shape
print(shape)

# Dimensions of your data

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

shape = data.shape

print(shape)

The results are listed in rows then columns. You can see that the dataset has 768 rows and 9 columns.

(768, 9)

(768, 9)

3. Data Type For Each Attribute

The type of each attribute is important.

Strings may need to be converted to floating point values or integers to represent categorical or ordinal values.

You can get an idea of the types of attributes by peeking at the raw data, as above. You can also list the data types used by the DataFrame to characterize each attribute using the dtypes property.

# Data Types for Each Attribute
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
types = data.dtypes
print(types)

# Data Types for Each Attribute

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

types = data.dtypes

print(types)

You can see that most of the attributes are integers and that mass and pedi are floating point values.

preg       int64
plas       int64
pres       int64
skin       int64
test       int64
mass     float64
pedi     float64
age        int64
class      int64
dtype: object

preg int64

plas int64

pres int64

skin int64

test int64

mass float64

pedi float64

age int64

class int64

dtype: object

4. Descriptive Statistics

Descriptive statistics can give you great insight into the shape of each attribute.

Often you can create more summaries than you have time to review. The describe() function on the Pandas DataFrame lists 8 statistical properties of each attribute:

Count
Mean
Standard Devaition
Minimum Value
25th Percentile
50th Percentile (Median)
75th Percentile
Maximum Value

# Statistical Summary
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
pandas.set_option('display.width', 100)
pandas.set_option('precision', 3)
description = data.describe()
print(description)

# Statistical Summary

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

pandas.set_option('display.width', 100)

pandas.set_option('precision', 3)

description = data.describe()

print(description)

You can see that you do get a lot of data. You will note some calls to pandas.set_option() in the recipe to change the precision of the numbers and the preferred width of the output. This is to make it more readable for this example.

When describing your data this way, it is worth taking some time and reviewing observations from the results. This might include the presence of “NA” values for missing data or surprising distributions for attributes.

          preg     plas     pres     skin     test     mass     pedi      age    class
count  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000  768.000
mean     3.845  120.895   69.105   20.536   79.799   31.993    0.472   33.241    0.349
std      3.370   31.973   19.356   15.952  115.244    7.884    0.331   11.760    0.477
min      0.000    0.000    0.000    0.000    0.000    0.000    0.078   21.000    0.000
25%      1.000   99.000   62.000    0.000    0.000   27.300    0.244   24.000    0.000
50%      3.000  117.000   72.000   23.000   30.500   32.000    0.372   29.000    0.000
75%      6.000  140.250   80.000   32.000  127.250   36.600    0.626   41.000    1.000
max     17.000  199.000  122.000   99.000  846.000   67.100    2.420   81.000    1.000

preg plas pres skin test mass pedi age class

count 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000 768.000

mean 3.845 120.895 69.105 20.536 79.799 31.993 0.472 33.241 0.349

std 3.370 31.973 19.356 15.952 115.244 7.884 0.331 11.760 0.477

min 0.000 0.000 0.000 0.000 0.000 0.000 0.078 21.000 0.000

25% 1.000 99.000 62.000 0.000 0.000 27.300 0.244 24.000 0.000

50% 3.000 117.000 72.000 23.000 30.500 32.000 0.372 29.000 0.000

75% 6.000 140.250 80.000 32.000 127.250 36.600 0.626 41.000 1.000

max 17.000 199.000 122.000 99.000 846.000 67.100 2.420 81.000 1.000

5. Class Distribution (Classification Only)

On classification problems you need to know how balanced the class values are.

Highly imbalanced problems (a lot more observations for one class than another) are common and may need special handling in the data preparation stage of your project.

You can quickly get an idea of the distribution of the class attribute in Pandas.

# Class Distribution
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
class_counts = data.groupby('class').size()
print(class_counts)

# Class Distribution

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

class_counts = data.groupby('class').size()

print(class_counts)

You can see that there are nearly double the number of observations with class 0 (no onset of diabetes) than there are with class 1 (onset of diabetes).

class
0    500
1    268

class

0 500

1 268

6. Correlation Between Attributes

Correlation refers to the relationship between two variables and how they may or may not change together.

The most common method for calculating correlation is Pearson’s Correlation Coefficient, that assumes a normal distribution of the attributes involved. A correlation of -1 or 1 shows a full negative or positive correlation respectively. Whereas a value of 0 shows no correlation at all.

Some machine learning algorithms like linear and logistic regression can suffer poor performance if there are highly correlated attributes in your dataset. As such, it is a good idea to review all of the pair-wise correlations of the attributes in your dataset. You can use the corr() function on the Pandas DataFrame to calculate a correlation matrix.

# Pairwise Pearson correlations
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
pandas.set_option('display.width', 100)
pandas.set_option('precision', 3)
correlations = data.corr(method='pearson')
print(correlations)

# Pairwise Pearson correlations

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

pandas.set_option('display.width', 100)

pandas.set_option('precision', 3)

correlations = data.corr(method='pearson')

print(correlations)

The matrix lists all attributes across the top and down the side, to give correlation between all pairs of attributes (twice, because the matrix is symmetrical). You can see the diagonal line through the matrix from the top left to bottom right corners of the matrix shows perfect correlation of each attribute with itself.

        preg   plas   pres   skin   test   mass   pedi    age  class
preg   1.000  0.129  0.141 -0.082 -0.074  0.018 -0.034  0.544  0.222
plas   0.129  1.000  0.153  0.057  0.331  0.221  0.137  0.264  0.467
pres   0.141  0.153  1.000  0.207  0.089  0.282  0.041  0.240  0.065
skin  -0.082  0.057  0.207  1.000  0.437  0.393  0.184 -0.114  0.075
test  -0.074  0.331  0.089  0.437  1.000  0.198  0.185 -0.042  0.131
mass   0.018  0.221  0.282  0.393  0.198  1.000  0.141  0.036  0.293
pedi  -0.034  0.137  0.041  0.184  0.185  0.141  1.000  0.034  0.174
age    0.544  0.264  0.240 -0.114 -0.042  0.036  0.034  1.000  0.238
class  0.222  0.467  0.065  0.075  0.131  0.293  0.174  0.238  1.000

preg plas pres skin test mass pedi age class

preg 1.000 0.129 0.141 -0.082 -0.074 0.018 -0.034 0.544 0.222

plas 0.129 1.000 0.153 0.057 0.331 0.221 0.137 0.264 0.467

pres 0.141 0.153 1.000 0.207 0.089 0.282 0.041 0.240 0.065

skin -0.082 0.057 0.207 1.000 0.437 0.393 0.184 -0.114 0.075

test -0.074 0.331 0.089 0.437 1.000 0.198 0.185 -0.042 0.131

mass 0.018 0.221 0.282 0.393 0.198 1.000 0.141 0.036 0.293

pedi -0.034 0.137 0.041 0.184 0.185 0.141 1.000 0.034 0.174

age 0.544 0.264 0.240 -0.114 -0.042 0.036 0.034 1.000 0.238

class 0.222 0.467 0.065 0.075 0.131 0.293 0.174 0.238 1.000

7. Skew of Univariate Distributions

Skew refers to a distribution that is assumed Gaussian (normal or bell curve) that is shifted or squashed in one direction or another.

Many machine learning algorithms assume a Gaussian distribution. Knowing that an attribute has a skew may allow you to perform data preparation to correct the skew and later improve the accuracy of your models.

You can calculate the skew of each attribute using the skew() function on the Pandas DataFrame.

# Skew for each attribute
import pandas
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = pandas.read_csv(url, names=names)
skew = data.skew()
print(skew)

# Skew for each attribute

import pandas

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv"

names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']

data = pandas.read_csv(url, names=names)

skew = data.skew()

print(skew)

The skew result show a positive (right) or negative (left) skew. Values closer to zero show less skew.

preg     0.901674
plas     0.173754
pres    -1.843608
skin     0.109372
test     2.272251
mass    -0.428982
pedi     1.919911
age      1.129597
class    0.635017

preg 0.901674

plas 0.173754

pres -1.843608

skin 0.109372

test 2.272251

mass -0.428982

pedi 1.919911

age 1.129597

class 0.635017

More Recipes

This was just a selection of the most useful summaries and descriptive statistics that you can use on your machine learning data for classification and regression.

There are many other statistics that you could calculate.

Is there a specific statistic that you like to calculate and review when you start working on a new data set? Leave a comment and let me know.

Tips To Remember

This section gives you some tips to remember when reviewing your data using summary statistics.

Review the numbers. Generating the summary statistics is not enough. Take a moment to pause, read and really think about the numbers you are seeing.
Ask why. Review your numbers and ask a lot of questions. How and why are you seeing specific numbers. Think about how the numbers relate to the problem domain in general and specific entities that observations relate to.
Write down ideas. Write down your observations and ideas. Keep a small text file or note pad and jot down all of the ideas for how variables may relate, for what numbers mean, and ideas for techniques to try later. The things you write down now while the data is fresh will be very valuable later when you are trying to think up new things to try.

Summary

In this post you discovered the importance of describing your dataset before you start work on your machine learning project.

You discovered 7 different ways to summarize your dataset using Python and Pandas:

Peek At Your Data
Dimensions of Your Data
Data Types
Class Distribution
Data Summary
Correlations
Skewness

Action Step

Open your Python interactive environment.
Type or copy-and-paste each recipe and see how it works.
Let me know how you go in the comments.

Do you have any questions about Python, Pandas or the recipes in this post? Leave a comment and ask your question, I will do my best to answer it.

40 Responses to Understand Your Machine Learning Data With Descriptive Statistics in Python

M. Wilson August 28, 2016 at 8:23 pm #

Excellent write-up. I definitely appreciate this site. Continue the good
work!

Reply
- Jason Brownlee August 29, 2016 at 8:07 am #
  
  Thanks M. Willson, I’m glad you found it useful.
  
  Reply
Mayur Agarwal April 19, 2017 at 10:11 pm #

Hi Jason .

Thankyou for the explanation .

it is very clear and well understood many things from the article .

Perhaps i have a doubt in understand the purpose of each step as like what message it is conveying with respect to the data. for example ” Descriptive Statistics” give some many output with respect to the data but what do i understand from that.

Seeking for some clarification on the same.

Reply
- Jason Brownlee April 20, 2017 at 9:26 am #
  
  We are understanding the univariate distribution of each feature.
  
  This can help in data preparation and algorithm selection.
  
  Reply
  - Hamza February 13, 2020 at 7:21 am #
    
    sir can you please explain how it help us in algorithm selection ?
    
    Reply
    - Jason Brownlee February 13, 2020 at 1:22 pm #
      
      E.g. a gaussian distribution may suggest a linear model. A skewed Gaussian may suggest a power transform. Outliers if present can be removed. Exponential distribution could be power transformed. Discrete integer values may suggest decision trees, different units may suggest scaling prior to knn and weighted sum algorithms, Etc.
      
      Reply
Neeraj Wagh May 2, 2017 at 4:48 pm #

Hey Jason, amazingly concise and effective post, as always.

Any suggestions on how to do exploratory analysis with binary features?

Please don’t stop your work, it’s immensely helpful. 🙂

Reply
- Jason Brownlee May 3, 2017 at 7:32 am #
  
  Thanks, glad to hear it Neeraj.
  
  A good start with binary and categorical variables is to look at proportions by level.
  
  Reply
Rizwan Mian, PhD August 1, 2017 at 11:55 am #

“Is there a specific statistic that you like to calculate and review when you start working on a new data set?”

> in addition to the descriptive stats given by .describe(), I like to calculate:
– median
– 95%-ile

Reply
- Rizwan Mian, PhD August 1, 2017 at 12:00 pm #
  
  oops. 50th percentile is median. already given by “.describe()”
  
  Reply
- Jason Brownlee August 2, 2017 at 7:41 am #
  
  Nice!
  
  Reply
Sarra,Data Scientist October 2, 2017 at 10:38 pm #

Hi, I want to thank you for your useful and well explained articles, specially this one.. But I would like to ask you about imbalanced data and resampling effectiveness if you don’t mind.

Well, as you have stated in your other article, in some cases the imbalance is so natural because like in my case where I have mainly 2 classes with this data distribution (A: 6418, B: 81) the class B is a rare phenomena that we are looking to understand its reasons.

I’m more interested in class B, but I’m afraid that resampling changed a lot in my dada since I noticed a change in correlation after applying an undersampling for class A and oversampling for B to get 1000:1000 samples in final dataset.

total correlation of my 6 features with the target feature passed from 0.14 to 0.67 which I find somehow artificial and not realistic.

If you can help me understand this, I will be so grateful. And thanks for the wonderful website 🙂

Reply
- Jason Brownlee October 3, 2017 at 5:41 am #
  
  I have some ideas for handling imbalanced data in this post that might help as a start:
  https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
  
  Reply
Aman Tandon March 31, 2018 at 9:22 pm #

I am building a linear regression model and using the dataset which is a CSV file of some numbers of 13 columns and 5299 rows. I am following your tutorials and applied the skew function and correlation but it showing an empty array and also in the data types section,it shows object for whole 13 columns. Please help

Reply
- Jason Brownlee April 1, 2018 at 5:49 am #
  
  Perhaps try posting your code and data to stackoverflow?
  
  Reply
Vineeth June 21, 2018 at 4:20 pm #

Hi Jason!
Loved the blog post! You structured it very well. Thanks for all the guidance.

I have a small question. As you mentioned in the data types section, many Machine Learning algorithms take numerical attributes as input. Any specific reason for why they do that??

Reply
- Jason Brownlee June 21, 2018 at 5:00 pm #
  
  Not really, more of an engineering reason so the same APIs can be used for many algorithms.
  
  Many algorithms don’t care about data types in principle, e.g. decision trees, knn, etc.
  
  Reply
Leni June 23, 2018 at 2:53 pm #

Dear Sir,

I am a new member to Python. I am lucky to find out this website and start learning with this. I just have a small practice myself based on your above lesson as follow (I try to load the data with numpy.loadtxt):

import pandas
import numpy
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataset = numpy.loadtxt(“pima-indians-diabetes.csv”, delimiter=”,”)
description = dataset.dtypes()
print(description)

However, there is an error and I don’t know how to correct.
“AttributeError: ‘numpy.ndarray’ object has no attribute ‘dtypes’ ”

Please help. Thank you.

Reply
- Jason Brownlee June 24, 2018 at 7:30 am #
  
  Looks like you have mixed up a Numpy data loading example with a pandas exploration example.
  
  You can laod the data as a DataFrame via pandas or change your loaded numpy array to a dataframe in order to call dtypes()
  
  Reply
  - Leni June 25, 2018 at 4:55 pm #
    
    Thank you so much. You are right. The problem is solved now.
    
    Reply
    - Jason Brownlee June 26, 2018 at 6:33 am #
      
      Glad to hear it.
      
      Reply
munaz August 30, 2018 at 6:10 pm #

U r Awesome… Thanks a lot

Reply
- Jason Brownlee August 31, 2018 at 8:08 am #
  
  Thanks!
  
  Reply
Mickel April 12, 2019 at 12:01 am #

Hello,

I have data about a project, I did use the skew method, but some number go to 20+ en some numbers stay at 0.5. What does that really mean?

I use no classification. I use a score between 0 – 10. Are this still good method to use? Or have someone an idea about that? Please help..

Greetzzz…

Reply
- Jason Brownlee April 12, 2019 at 7:48 am #
  
  Perhaps try model with and without some scaling/transforms and see if it matters based on model skill.
  
  Reply
Wing Tian May 14, 2019 at 4:06 am #

Hi Jason,

i like your way of teaching as always.

Could you advise some codes to make the correlation more visualized? By the number itself, i guess it is hard to read. But as far as i know, there are some heat-map we could utilize.

Thanks for your help in advance.

Reply
- Jason Brownlee May 14, 2019 at 7:52 am #
  
  Yes, you can use a scatter plot matrix of each pair of variables.
  
  Reply
THARUN KUMAR TALLAPALLI June 28, 2019 at 8:07 pm #

your analysis good!
What if I want to know details regarding features?
where DESCR is used?

Reply
- Jason Brownlee June 29, 2019 at 6:48 am #
  
  Sorry, what do you mean exactly, can you elaborate please?
  
  What is DESCR? description?
  
  Reply
Ahmad Abdalrada July 7, 2019 at 7:26 am #

All good thanks a lot

Reply
- Jason Brownlee July 7, 2019 at 7:56 am #
  
  Great.
  
  Reply
shadia February 1, 2020 at 11:25 pm #

thank you jason for your posr
i wonder if you couid help me
i’m trying to build a random forest model
i used panadas to read a csv dataset
after splitting the data set into y and x i’m trying to print x columns but i have this error

AttributeError: ‘numpy.ndarray’ object has no attribute ‘columns’

Reply
- Jason Brownlee February 2, 2020 at 6:26 am #
  
  It looks like you are trying to call .columns on a numpy array. You cannot.
  
  Perhaps try the same code with a dataframe of your data.
  
  Reply
KAPILA February 10, 2020 at 9:58 pm #

Very good write up, clearly explains.
Can you add few plots like scatter, histogram etc too?

Reply
- Jason Brownlee February 11, 2020 at 5:13 am #
  
  Thanks.
  
  For plots, see this tutorial:
  https://machinelearningmastery.com/data-visualization-methods-in-python/
  
  Reply
Kavidha June 4, 2021 at 4:35 am #

In machine learning, some say descriptive analysis is done to the whole dataset while some say it is done to the trained dataset only. Can you explain which one is the correct way?
Thank you.

Reply
- Jason Brownlee June 4, 2021 at 7:06 am #
  
  It depends on the goal of your project.
  
  Reply
Mosam khan October 29, 2021 at 2:13 am #

I like the way you explain everything,

thank you so much dear sir.

Reply
PriyaK June 10, 2022 at 4:46 am #

Thank you for the step-by-step process for data preparation.

Reply
- James Carmichael June 10, 2022 at 9:15 am #
  
  You are very welcome PriyaK! Thank you for the feedback and support!
  
  Reply

Navigation

Understand Your Machine Learning Data With Descriptive Statistics in Python

Python Recipes To Understand Your Machine Learning Data

Need help with Machine Learning in Python?

1. Peek at Your Data

2. Dimensions of Your Data

3. Data Type For Each Attribute

4. Descriptive Statistics

5. Class Distribution (Classification Only)

6. Correlation Between Attributes

7. Skew of Univariate Distributions

More Recipes

Tips To Remember

Summary

Action Step

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To
Your Own Projects

More On This Topic

40 Responses to Understand Your Machine Learning Data With Descriptive Statistics in Python

Leave a Reply Click here to cancel reply.

Navigation

Python Recipes To Understand Your Machine Learning Data

Need help with Machine Learning in Python?

1. Peek at Your Data

2. Dimensions of Your Data

3. Data Type For Each Attribute

4. Descriptive Statistics

5. Class Distribution (Classification Only)

6. Correlation Between Attributes

7. Skew of Univariate Distributions

More Recipes

Tips To Remember

Summary

Action Step

Discover Fast Machine Learning in Python!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

40 Responses to Understand Your Machine Learning Data With Descriptive Statistics in Python

Leave a Reply Click here to cancel reply.

Finally Bring Machine Learning To
Your Own Projects