Better Understand Your Data in R Using Descriptive Statistics

By Jason Brownlee on August 22, 2019 in R Machine Learning 33

You must become intimate with your data.

Any machine learning models that you build are only as good as the data that you provide them. The first step in understanding your data is to actually look at some raw values and calculate some basic statistics.

In this post, you will discover how you can quickly get a handle on your dataset with descriptive statistics examples and recipes in R.

These recipes are perfect for you if you are a developer just getting started using R for machine learning.

Kick-start your project with my new book Machine Learning Mastery With R, including step-by-step tutorials and the R source code files for all examples.

Let’s get started.

Update Nov/2016: As a helpful update, this tutorial assumes you have the mlbench and e1071 R packages installed. They can be installed by typing: install.packages(“e1071”, “mlbench”)

Understand Your Data in R Using Descriptive Statistics
Photo by Enamur Reza, some rights reserved.

You Must Understand Your Data

Understanding the data that you have is critically important.

You can run techniques and algorithms on your data, but it is not until you take the time to truly understand your dataset that you can fully understand the context of the results you achieve.

Better Understanding Equals Better Results

A deeper understanding of your data will give you better results.

Taking the time to study the data you have will help you in ways that are less obvious. You build an intuition for the data and for the entities that individual records or observations represent. These can bias you towards specific techniques (for better or worse), but you can also be inspired.

For example, examine your data in detail may trigger ideas for specific techniques to investigate:

Data Cleaning. You may discover missing or corrupt data and think of various data cleaning operations to perform such as marking or removing bad data and imputing missing data.
Data Transforms. You may discover that some attributes have familiar distributions such as Gaussian or exponential giving you ideas of scaling or log or other transforms you could apply.
Data Modeling. You may notice properties of the data such as distributions or data types that suggest the use (or to not use) specific machine learning algorithms.

Use Descriptive Statistics

You need to look at your data. And you need to look at your data from different perspectives.

Inspecting your data will help you to build up your intuition and prompt you to start asking questions about the data that you have.

Multiple perspectives will challenge you to think about the data from different perspectives, helping you to ask more and better questions.

Two methods for looking at your data are:

Descriptive Statistics
Data Visualization

The first and best place to start is to calculate basic summary descriptive statistics on your data.

You need to learn the shape, size, type and general layout of the data that you have.

Let’s look at some ways that you can summarize your data using R.

Need more Help with R for Machine Learning?

Take my free 14-day email course and discover how to use R on your project (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Summarize Data in R With Descriptive Statistics

In this section, you will discover 8 quick and simple ways to summarize your dataset.

Each method is briefly described and includes a recipe in R that you can run yourself or copy and adapt to your own needs.

1. Peek At Your Data

The very first thing to do is to just look at some raw data from your dataset.

If your dataset is small you might be able to display it all on the screen. Often it is not, so you can take a small sample and review that.

# load the library
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# display first 20 rows of data
head(PimaIndiansDiabetes, n=20)

# load the library

library(mlbench)

# load the dataset

data(PimaIndiansDiabetes)

# display first 20 rows of data

head(PimaIndiansDiabetes, n=20)

The head function will display the first 20 rows of data for you to review and think about.

   pregnant glucose pressure triceps insulin mass pedigree age diabetes
1         6     148       72      35       0 33.6    0.627  50      pos
2         1      85       66      29       0 26.6    0.351  31      neg
3         8     183       64       0       0 23.3    0.672  32      pos
4         1      89       66      23      94 28.1    0.167  21      neg
5         0     137       40      35     168 43.1    2.288  33      pos
6         5     116       74       0       0 25.6    0.201  30      neg
7         3      78       50      32      88 31.0    0.248  26      pos
8        10     115        0       0       0 35.3    0.134  29      neg
9         2     197       70      45     543 30.5    0.158  53      pos
10        8     125       96       0       0  0.0    0.232  54      pos
11        4     110       92       0       0 37.6    0.191  30      neg
12       10     168       74       0       0 38.0    0.537  34      pos
13       10     139       80       0       0 27.1    1.441  57      neg
14        1     189       60      23     846 30.1    0.398  59      pos
15        5     166       72      19     175 25.8    0.587  51      pos
16        7     100        0       0       0 30.0    0.484  32      pos
17        0     118       84      47     230 45.8    0.551  31      pos
18        7     107       74       0       0 29.6    0.254  31      pos
19        1     103       30      38      83 43.3    0.183  33      neg
20        1     115       70      30      96 34.6    0.529  32      pos

pregnant glucose pressure triceps insulin mass pedigree age diabetes

1 6 148 72 35 0 33.6 0.627 50 pos

2 1 85 66 29 0 26.6 0.351 31 neg

3 8 183 64 0 0 23.3 0.672 32 pos

4 1 89 66 23 94 28.1 0.167 21 neg

5 0 137 40 35 168 43.1 2.288 33 pos

6 5 116 74 0 0 25.6 0.201 30 neg

7 3 78 50 32 88 31.0 0.248 26 pos

8 10 115 0 0 0 35.3 0.134 29 neg

9 2 197 70 45 543 30.5 0.158 53 pos

10 8 125 96 0 0 0.0 0.232 54 pos

11 4 110 92 0 0 37.6 0.191 30 neg

12 10 168 74 0 0 38.0 0.537 34 pos

13 10 139 80 0 0 27.1 1.441 57 neg

14 1 189 60 23 846 30.1 0.398 59 pos

15 5 166 72 19 175 25.8 0.587 51 pos

16 7 100 0 0 0 30.0 0.484 32 pos

17 0 118 84 47 230 45.8 0.551 31 pos

18 7 107 74 0 0 29.6 0.254 31 pos

19 1 103 30 38 83 43.3 0.183 33 neg

20 1 115 70 30 96 34.6 0.529 32 pos

2. Dimensions of Your Data

How much data do you have? You may have a general idea, but it is much better to have a precise figure.

If you have a lot of instances, you may need to work with a smaller sample of the data so that model training and evaluation is computationally tractable. If you have a vast number of attributes, you may need to select those that are most relevant. If you have more attributes than instances you may need to select specific modeling techniques.

# load the libraries
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# display the dimensions of the dataset
dim(PimaIndiansDiabetes)

# load the libraries

library(mlbench)

# load the dataset

data(PimaIndiansDiabetes)

# display the dimensions of the dataset

dim(PimaIndiansDiabetes)

This shows the rows and columns of your loaded dataset.

[1] 768   9

[1] 768 9

3. Data Types

You need to know the types of the attributes in your data.

This is invaluable. The types will indicate the types of further analysis, types of visualization and even the types of machine learning algorithms that you can use.

Additionally, perhaps some attributes were loaded as one type (e.g. integer) and could in-fact be represented as another type (a categorical factor). Inspecting the types helps expose these issues and spark ideas early.

# load library
library(mlbench)
# load dataset
data(BostonHousing)
# list types for each attribute
sapply(BostonHousing, class)

# load library

library(mlbench)

# load dataset

data(BostonHousing)

# list types for each attribute

sapply(BostonHousing, class)

This lists the data type of each attribute in your dataset.

     crim        zn     indus      chas       nox        rm       age       dis       rad       tax   ptratio         b 
"numeric" "numeric" "numeric"  "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
    lstat      medv 
"numeric" "numeric"

crim zn indus chas nox rm age dis rad tax ptratio b

"numeric" "numeric" "numeric" "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"

lstat medv

"numeric" "numeric"

4. Class Distribution

In a classification problem, you must know the proportion of instances that belong to each class value.

This is important because it may highlight an imbalance in the data, that if severe may need to be addressed with rebalancing techniques. In the case of a multi-class classification problem, it may expose class with a small or zero instances that may be candidates for removing from the dataset.

# load the libraries
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# distribution of class variable
y <- PimaIndiansDiabetes$diabetes
cbind(freq=table(y), percentage=prop.table(table(y))*100)

# load the libraries

library(mlbench)

# load the dataset

data(PimaIndiansDiabetes)

# distribution of class variable

y <- PimaIndiansDiabetes$diabetes

cbind(freq=table(y), percentage=prop.table(table(y))*100)

This recipe creates a useful table showing the number of instances that belong to each class as well as the percentage that this represents from the entire dataset.

    freq percentage
neg  500   65.10417
pos  268   34.89583

freq percentage

neg 500 65.10417

pos 268 34.89583

5. Data Summary

There is a most valuable function called summary() that summarizes each attribute in your dataset in turn. This is a most valuable function.

The function creates a table for each attribute and lists a breakdown of values. Factors are described as counts next to each class label. Numerical attributes are described as:

Min
25th percentile
Median
Mean
75th percentile
Max

The breakdown also includes an indication of the number of missing values for an attribute (marked N/A).

# load the iris dataset
data(iris)
# summarize the dataset
summary(iris)

# load the iris dataset

data(iris)

# summarize the dataset

summary(iris)

You can see that this recipe produces a lot of information for you to review. Take your time and work through each attribute in turn.

  Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :50  
 1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:50  
 Median :5.800   Median :3.000   Median :4.350   Median :1.300   virginica :50  
 Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199                  
 3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50

1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300 versicolor:50

Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50

Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199

3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800

Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

6. Standard Deviations

One thing missing from the summary() function above are the standard deviations.

The standard deviation along with the mean are useful to know if the data has a Gaussian (or nearly Gaussian) distribution. For example, it can useful for a quick and dirty outlier removal tool, where any values that are more than three times the standard deviation from the mean are outside of 99.7 of the data.

# load the libraries
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# calculate standard deviation for all attributes
sapply(PimaIndiansDiabetes[,1:8], sd)

# load the libraries

library(mlbench)

# load the dataset

data(PimaIndiansDiabetes)

# calculate standard deviation for all attributes

sapply(PimaIndiansDiabetes[,1:8], sd)

This calculates the standard deviation for each numeric attribute in the dataset.

   pregnant     glucose    pressure     triceps     insulin        mass    pedigree         age 
  3.3695781  31.9726182  19.3558072  15.9522176 115.2440024   7.8841603   0.3313286  11.7602315

1 2	pregnant glucose pressure triceps insulin mass pedigree age 3.3695781 31.9726182 19.3558072 15.9522176 115.2440024 7.8841603 0.3313286 11.7602315

7. Skewness

If a distribution looks kind-of-Gaussian but is pushed far left or right it is useful to know the skew.

Getting a feeling for the skew is much easier with plots of the data, such as a histogram or density plot. It is harder to tell from looking at means, standard deviations and quartiles.

Nevertheless, calculating the skew up front gives you a reference that you can use later if you decide to correct the skew for an attribute.

# load libraries
library(mlbench)
library(e1071)
# load the dataset
data(PimaIndiansDiabetes)
# calculate skewness for each variable
skew <- apply(PimaIndiansDiabetes[,1:8], 2, skewness)
# display skewness, larger/smaller deviations from 0 show more skew
print(skew)

# load libraries

library(mlbench)

library(e1071)

# load the dataset

data(PimaIndiansDiabetes)

# calculate skewness for each variable

skew <- apply(PimaIndiansDiabetes[,1:8], 2, skewness)

# display skewness, larger/smaller deviations from 0 show more skew

print(skew)

The further the distribution of the skew value from zero, the larger the skew to the left (negative skew value) or right (positive skew value).

  pregnant    glucose   pressure    triceps    insulin       mass   pedigree        age 
 0.8981549  0.1730754 -1.8364126  0.1089456  2.2633826 -0.4273073  1.9124179  1.1251880

1 2	pregnant glucose pressure triceps insulin mass pedigree age 0.8981549 0.1730754 -1.8364126 0.1089456 2.2633826 -0.4273073 1.9124179 1.1251880

8. Correlations

It is important to observe and think about how attributes relate to each other.

For numeric attributes, an excellent way to think about attribute-to-attribute interactions is to calculate correlations for each pair of attributes.

# load the libraries
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# calculate a correlation matrix for numeric variables
correlations <- cor(PimaIndiansDiabetes[,1:8])
# display the correlation matrix
print(correlations)

# load the libraries

library(mlbench)

# load the dataset

data(PimaIndiansDiabetes)

# calculate a correlation matrix for numeric variables

correlations <- cor(PimaIndiansDiabetes[,1:8])

# display the correlation matrix

print(correlations)

This creates a symmetrical table of all pairs of attribute correlations for numerical data. Deviations from zero show more positive or negative correlation. Values above 0.75 or below -0.75 are perhaps more interesting as they show a high correlation. Values of 1 and -1 show full positive or negative correlation.

            pregnant    glucose   pressure     triceps     insulin       mass    pedigree         age
pregnant  1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461 0.01768309 -0.03352267  0.54434123
glucose   0.12945867 1.00000000 0.15258959  0.05732789  0.33135711 0.22107107  0.13733730  0.26351432
pressure  0.14128198 0.15258959 1.00000000  0.20737054  0.08893338 0.28180529  0.04126495  0.23952795
triceps  -0.08167177 0.05732789 0.20737054  1.00000000  0.43678257 0.39257320  0.18392757 -0.11397026
insulin  -0.07353461 0.33135711 0.08893338  0.43678257  1.00000000 0.19785906  0.18507093 -0.04216295
mass      0.01768309 0.22107107 0.28180529  0.39257320  0.19785906 1.00000000  0.14064695  0.03624187
pedigree -0.03352267 0.13733730 0.04126495  0.18392757  0.18507093 0.14064695  1.00000000  0.03356131
age       0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295 0.03624187  0.03356131  1.00000000

pregnant glucose pressure triceps insulin mass pedigree age

pregnant 1.00000000 0.12945867 0.14128198 -0.08167177 -0.07353461 0.01768309 -0.03352267 0.54434123

glucose 0.12945867 1.00000000 0.15258959 0.05732789 0.33135711 0.22107107 0.13733730 0.26351432

pressure 0.14128198 0.15258959 1.00000000 0.20737054 0.08893338 0.28180529 0.04126495 0.23952795

triceps -0.08167177 0.05732789 0.20737054 1.00000000 0.43678257 0.39257320 0.18392757 -0.11397026

insulin -0.07353461 0.33135711 0.08893338 0.43678257 1.00000000 0.19785906 0.18507093 -0.04216295

mass 0.01768309 0.22107107 0.28180529 0.39257320 0.19785906 1.00000000 0.14064695 0.03624187

pedigree -0.03352267 0.13733730 0.04126495 0.18392757 0.18507093 0.14064695 1.00000000 0.03356131

age 0.54434123 0.26351432 0.23952795 -0.11397026 -0.04216295 0.03624187 0.03356131 1.00000000

More Recipes

This list of data summarization methods is by no means complete, but they are enough to quickly give you a strong initial understanding of your dataset.

Some data summarization that you could investigate beyond the list of recipes above would be to look at statistics for subsets of your data. Consider looking into the aggregate() function in R.

Is there a data summarization recipe that you use that was not listed? Leave a comment below, I’d love to hear about it.

Tips To Remember

This section gives you some tips to remember when reviewing your data using summary statistics.

Review the numbers. Generating the summary statistics is not enough. Take a moment to pause, read and really think about the numbers you are seeing.
Ask why. Review your numbers and ask a lot of questions. How and why are you seeing specific numbers? Think about how the numbers relate to the problem domain in general and specific entities that observations relate to.
Write down ideas. Write down your observations and ideas. Keep a small text file or notepad and jot down all of the ideas for how variables may relate, for what numbers mean, and ideas for techniques to try later. The things you write down now while the data is fresh will be very valuable later when you are trying to think up new things to try.

You Can Summarize Your Data in R

You do not need to be an R programmer. Data summarization in R is very simple, as the recipes above can attest. If you are just getting started, you can copy and paste the recipes above and start learning how they work using the built-in help in R (for example: ?FunctionName).

You do not need to be good at statistics. The statistics used in this post are very simple, but you may have forgotten some of the basics. You can quickly browse Wikipedia for topics like Mean, Standard Deviation and Quartiles to refresh your knowledge.

Here is a short list:

For a related post, see: Crash Course in Statistics for Machine Learning.

You do not need your own datasets. Each example above uses a built-in dataset or a dataset provided by an R package. There are many interesting datasets in the dataset R package that you can investigate and play with. See the documentation for the datasets R package for more information.

Summary

In this post, you discovered the importance of describing your dataset before you start work on your machine learning project.

You discovered 8 different ways to summarize your dataset using R:

Peek At Your Data
Dimensions of Your Data
Data Types
Class Distribution
Data Summary
Standard Deviations
Skewness
Correlations

You also now have recipes that you can copy and paste into your project.

Action Step

Do you want to improve your skills using R or practicing machine learning in R?

Work through each example above.

Open the R interactive environment.
Type or copy-paste each recipe and understand how it works.
Dive deeper and use the ?FunctionName to learn more about the specific functions used.

Report back and leave a comment, I’d love to hear how you went.

Do you have a question? Leave a comment and ask.

33 Responses to Better Understand Your Data in R Using Descriptive Statistics

Nader September 2, 2016 at 1:02 am #

Thank you for this Fantastic Blog !!!

Reply
- Jason Brownlee September 2, 2016 at 8:08 am #
  
  I’m glad you’re finding it useful Nader.
  
  Reply
Marshall A Dyson September 2, 2016 at 1:46 am #

Nice article, I liked it.
The line in section 5 between the boxes that states “for your to review” should probably be “for you to review”.
First sentence in section 2 should probably ended with a question mark. The same goes for the second sentence in the “Tips To Remember” section under the “Ask why” point.

Reply
- Marshall A Dyson September 2, 2016 at 1:49 am #
  
  Sorry, not section 5. Section 1.
  
  Reply
- Jason Brownlee September 2, 2016 at 8:10 am #
  
  Thanks Marshall, fixed.
  
  Reply
Ganesh September 15, 2016 at 3:43 pm #

Awesome Jason! This will help me to get my hands dirty in R!

Reply
- Jason Brownlee September 16, 2016 at 9:00 am #
  
  I’m glad to here it Ganesh.
  
  Reply
Bao December 17, 2016 at 5:21 am #

Thanks for the guidance. Very useful for beginner!

Reply
- Jason Brownlee December 17, 2016 at 11:15 am #
  
  I’m glad to hear that Bao.
  
  Reply
bharat ram March 11, 2017 at 6:14 am #

Thanks a ton Jason. I have been looking for some structured way of descriptive analysis for quite long time. Your article surely a great answer to many of my questions.

Awesome collation and guide. Appreciate your help..

Reply
- Jason Brownlee March 11, 2017 at 8:02 am #
  
  Glad to hear it!
  
  Reply
Alessandro Fortunato October 18, 2017 at 10:24 pm #

Which simple statistics do you recommend for a total random distribution?
I have the following problem, because I have 130.000 systems and every month I count the number of total errors for each of them and up to now I could not find any distribution that fits the data.
My interest is not on the monthly total amount but on the behavior of each of the machine.
Machine1, januaryerrors,februaryerrors, marcherrors,…..

Reply
- Dan Gustafsson November 27, 2020 at 7:46 am #
  
  What an interesting set-up! What is it that you want to learn about your 130 000 systems? Reduce the errors, or predict something??
  
  Reply
Andrey February 21, 2018 at 1:59 pm #

Good note, but I would add visualization for clarity (histogram, boxplot and Q-Q plot)

Reply
- Jason Brownlee February 22, 2018 at 11:15 am #
  
  Nice.
  
  Reply
Sima June 9, 2018 at 7:32 pm #

very useful thanks a lot

Reply
- Jason Brownlee June 10, 2018 at 5:59 am #
  
  I’m glad to hear that.
  
  Reply
MIsab July 22, 2018 at 10:41 pm #

Very useful, as a beginner i learned a lot. thank u very much.

skewness function (skew <- apply(PimaIndiansDiabetes[,1:8], 2, skewness)) was not working for me. any clarification?

Reply
- Jason Brownlee July 23, 2018 at 6:10 am #
  
  Sorry to hear that, I’m not sure. Perhaps try posting on stackoverflow?
  
  Reply
Toinét Cronjé August 2, 2018 at 9:30 pm #

Such a useful guide. Thank you very much!!!

Reply
- Jason Brownlee August 3, 2018 at 6:02 am #
  
  I’m glad to hear that.
  
  Reply
Elena June 20, 2019 at 10:33 am #

If our data are from a clinical trial and there are a lot of missing values , this way of summary(my_data) works?

What about the missing values?

.Thank you very much.

Reply
- Jason Brownlee June 20, 2019 at 1:58 pm #
  
  I have some advice here that might help:
  https://machinelearningmastery.com/faq/single-faq/how-do-i-handle-missing-data
  
  Reply
Anthony The Koala October 2, 2019 at 11:12 pm #

Dear Dr Jason,
I was able to get the simultaneous boxplots for the Pima Indians database

boxplot(PimaIndiansDiabetes[,1:8])

1

boxplot(PimaIndiansDiabetes[,1:8])

BUT
When I tried to get the simultaneous historgrams, it was a different story:

> hist(PimaIndianDiabetes[,1:8]) Error in hist(PimaIndianDiabetes[, 1:8]) : object 'PimaIndianDiabetes' not found

1
2
3

> hist(PimaIndianDiabetes[,1:8])
Error in hist(PimaIndianDiabetes[, 1:8]) :
object 'PimaIndianDiabetes' not found

Do you have an idea please?

Thank you,
Anthony of Sydney

Reply
- Jason Brownlee October 3, 2019 at 6:50 am #
  
  Enumerate each column and create a histogram plot for each.
  
  Reply
- Kim January 24, 2021 at 9:14 am #
  
  You have a typo in you dataset name – Indians – missing s
  
  Reply

Anthony The Koala October 3, 2019 at 2:34 pm #

Dear Dr Jason,
Thank you,

> doo1  par(mfrow=c(2,4))
> hist(PimaIndiansDiabetes[,1])
> hist(PimaIndiansDiabetes[,2])
> hist(PimaIndiansDiabetes[,3])
> hist(PimaIndiansDiabetes[,4])
> hist(PimaIndiansDiabetes[,5])
> hist(PimaIndiansDiabetes[,6])
> hist(PimaIndiansDiabetes[,7])
> hist(PimaIndiansDiabetes[,8])

> doo1 par(mfrow=c(2,4))

> hist(PimaIndiansDiabetes[,1])

> hist(PimaIndiansDiabetes[,2])

> hist(PimaIndiansDiabetes[,3])

> hist(PimaIndiansDiabetes[,4])

> hist(PimaIndiansDiabetes[,5])

> hist(PimaIndiansDiabetes[,6])

> hist(PimaIndiansDiabetes[,7])

> hist(PimaIndiansDiabetes[,8])

You get all 8 histograms on one plot.

It shows that in R you can plot a group of boxplots in one line, BUT cannot plot a group of histograms in one plot.

Jason Brownlee October 4, 2019 at 5:38 am #

I’m sure you can, there are thousands of packages out there.

Also, a loop would do the same thing in 2 lines.

Reply

Anthony The Koala October 3, 2019 at 2:36 pm #

Dear Dr Jason,
Apologies, the 1st line doo1 should not be there. Don’t know how it got there.
The first line should be

par(mfrow=c(2,4))

1

par(mfrow=c(2,4))

Thank you,
Anthony of Sydney

Reply
- Jason Brownlee October 4, 2019 at 5:38 am #
  
  Yes.
  
  Reply
Anthony The Koala October 4, 2019 at 4:28 am #

Dear Dr Jason,
A more elegant solution, though it could be more elegant with labels.

x <- c(1,2,3,4,5,6,7,8) par(mfrow=c(2,4)) for (i in x){ hist(PimaIndiansDiabetes[,i]) }

1
2
3
4
5

x <- c(1,2,3,4,5,6,7,8)
par(mfrow=c(2,4))
for (i in x){
hist(PimaIndiansDiabetes[,i])
}

But still for multiple boxplots you can use one line, BUT not for a group of histograms.

Thank you,
Anthony of Sydney

Reply
Chinasa Okonkwo September 27, 2022 at 10:37 pm #

Thanks Jason.

This is quite helpful.

Reply
- James Carmichael September 28, 2022 at 6:44 am #
  
  You are very welcome Chinasa! We appreciate your support!
  
  Reply

Navigation

Better Understand Your Data in R Using Descriptive Statistics

You Must Understand Your Data

Better Understanding Equals Better Results

Use Descriptive Statistics

Need more Help with R for Machine Learning?

Summarize Data in R With Descriptive Statistics

1. Peek At Your Data

2. Dimensions of Your Data

3. Data Types

4. Class Distribution

5. Data Summary

6. Standard Deviations

7. Skewness

8. Correlations

More Recipes

Tips To Remember

You Can Summarize Your Data in R

Summary

Action Step

Discover Faster Machine Learning in R!

Develop Your Own Models in Minutes

Finally Bring Machine Learning To Your Own Projects

More On This Topic

33 Responses to Better Understand Your Data in R Using Descriptive Statistics

Leave a Reply Click here to cancel reply.