Basic Statistical Analysis with NumPy

By Jayita Gulati on August 23, 2024 in Statistics 0

Introduction

Statistical analysis is important in data science. It helps us understand data better. NumPy is a key Python library for numerical operations. It simplifies and speeds up this process. In this article, we will explore several functions for basic statistical analysis offered by NumPy.

NumPy is a Python library for numerical computing. It helps with working on arrays and mathematical functions. It makes calculations faster and easier. NumPy is essential for data analysis and scientific work in Python.

To get started, you first need to import NumPy to do statistical analysis.

import numpy as np

1	import numpy as np

By convention, we use np as an alias for NumPy. This makes it easier to call its functions.

Let’s now have a look at several key statistical functions for basic statistical analysis in NumPy.

Mean

The mean is a measure of central tendency. It is the total of all values divided by how many values there are. We use the mean() function to calculate the mean.

Syntax: np.mean(data)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the mean
mean = np.mean(data)

# Print the result
print(f"Mean: {mean}")

# Mean: 3.0

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the mean

mean = np.mean(data)

# Print the result

print(f"Mean: {mean}")

# Mean: 3.0

Average

The average is often used interchangeably with the mean. It is the total of all values divided by how many values there are. We use average() function to calculate the average. This function is useful because it allows for the inclusion of weights to compute a weighted average.

Syntax: np.average(data), np.average(data, weights=weights)

# Sample data
data = np.array([1, 2, 3, 4, 5])
weights = np.array([1, 2, 3, 4, 5])

# Calculate the average
average = np.average(data)

# Calculate the weighted average
weighted_average = np.average(data, weights=weights)

# Print the results
print(f"Average: {average}")
print(f"Weighted Average: {weighted_average}")

# Average: 3.0
# Weighted Average: 3.6666666666666665

# Sample data

data = np.array([1, 2, 3, 4, 5])

weights = np.array([1, 2, 3, 4, 5])

# Calculate the average

average = np.average(data)

# Calculate the weighted average

weighted_average = np.average(data, weights=weights)

# Print the results

print(f"Average: {average}")

print(f"Weighted Average: {weighted_average}")

# Average: 3.0

# Weighted Average: 3.6666666666666665

Median

The median is the middle value in an ordered dataset. The median is the middle value when the dataset has an odd number of values. The median is the average of the two middle values when the dataset has an even number of values. We use the median() function to calculate the median.

Syntax: np.median(data)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the median
median = np.median(data)

# Print the result
print(f"Median: {median}")

# Median: 3.0

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the median

median = np.median(data)

# Print the result

print(f"Median: {median}")

# Median: 3.0

Variance

Variance measures how spread out the numbers are from the mean. It shows how much the values in a dataset differ from the average. A higher variance means more spread. We use the var() function to calculate the variance.

Syntax: np.var(data)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the variance
variance = np.var(data)

# Print the result
print(f"Variance: {variance}")

# Variance: 2.0

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the variance

variance = np.var(data)

# Print the result

print(f"Variance: {variance}")

# Variance: 2.0

Standard Deviation

Standard deviation shows how much the numbers vary from the mean. It is the square root of variance. A higher standard deviation means more spread. It’s easier to understand because it uses the same units as the data. We use the std() function to calculate the standard deviation.

Syntax: np.std(data)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the standard deviation
std_dev = np.std(data)

# Print the result
print(f"Standard Deviation: {std_dev}")

# Standard Deviation: 1.4142135623730951

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the standard deviation

std_dev = np.std(data)

# Print the result

print(f"Standard Deviation: {std_dev}")

# Standard Deviation: 1.4142135623730951

Minimum and Maximum

The minimum and maximum functions help identify the smallest and largest values in a dataset, respectively. We use the min() and max() functions to calculate these values.

Syntax: np.min(data), np.max(data)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the minimum and maximum
minimum = np.min(data)
maximum = np.max(data)

# Print the results
print(f"Minimum: {minimum}")
print(f"Maximum: {maximum}")

# Minimum: 1
# Maximum: 5

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the minimum and maximum

minimum = np.min(data)

maximum = np.max(data)

# Print the results

print(f"Minimum: {minimum}")

print(f"Maximum: {maximum}")

# Minimum: 1

# Maximum: 5

Percentiles

Percentiles show where a value stands in a dataset. For example, the 25th percentile is the value below which 25% of the data falls. Percentiles help us understand the distribution of the data. We use the percentile() function to calculate percentiles.

Syntax: np.percentile(data, percentile_value)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the 25th and 75th percentiles
percentiles = np.percentile(data, [25, 75])

# Print the results
print(f"25th Percentile: {percentiles[0]}")
print(f"75th Percentile: {percentiles[1]}")

# 25th Percentile: 2.0
# 75th Percentile: 4.0

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the 25th and 75th percentiles

percentiles = np.percentile(data, [25, 75])

# Print the results

print(f"25th Percentile: {percentiles[0]}")

print(f"75th Percentile: {percentiles[1]}")

# 25th Percentile: 2.0

# 75th Percentile: 4.0

Correlation Coefficient

The correlation coefficient shows how two variables relate linearly. It ranges from -1 to 1. A value of 1 means a positive relationship. A value of -1 means a negative relationship. A value of 0 means no linear relationship. We use the corrcoef() function to calculate the correlation coefficient.

Syntax: correlation_matrix = np.corrcoef(data1, data2), correlation_coefficient = correlation_matrix[0, 1]

# Sample data
data1 = np.array([1, 2, 3, 4, 5])
data2 = np.array([5, 4, 3, 2, 1])

# Calculate the correlation coefficient matrix
correlation_matrix = np.corrcoef(data1, data2)

# Extract the correlation coefficient between data1 and data2
correlation_coefficient = correlation_matrix[0, 1]
print(f"Correlation Coefficient: {correlation_coefficient}")

# Correlation Coefficient: -1.0

# Sample data

data1 = np.array([1, 2, 3, 4, 5])

data2 = np.array([5, 4, 3, 2, 1])

# Calculate the correlation coefficient matrix

correlation_matrix = np.corrcoef(data1, data2)

# Extract the correlation coefficient between data1 and data2

correlation_coefficient = correlation_matrix[0, 1]

print(f"Correlation Coefficient: {correlation_coefficient}")

# Correlation Coefficient: -1.0

Range (Peak-to-Peak)

Range (Peak-to-Peak) measures the spread of data. It is the difference between the highest and lowest values. This helps us see how spread out the data is. We use the ptp() function from to calculate the range.

Syntax: range = np.ptp(data)

# Sample data
data = np.array([1, 2, 3, 4, 5])

# Calculate the range
range = np.ptp(data)

# Print the result
print(f"Range: {range}")

# Range: 4

# Sample data

data = np.array([1, 2, 3, 4, 5])

# Calculate the range

range = np.ptp(data)

# Print the result

print(f"Range: {range}")

# Range: 4

Conclusion

NumPy helps with basic statistical analysis. For more complex statistics, other libraries like SciPy can be used. Knowing these basics helps improve data analysis.

Navigation

Basic Statistical Analysis with NumPy

Introduction

Mean

Average

Median

Variance

Standard Deviation

Minimum and Maximum

Percentiles

Correlation Coefficient

Range (Peak-to-Peak)

Conclusion

Get a Handle on Statistics for Machine Learning!

Develop a working understanding of statistics

Discover how to Transform Data into Knowledge

More On This Topic

No comments yet.

Leave a Reply Click here to cancel reply.