8 Essential Python Tips For Descriptive Data Analysis

Ugur Comlekcioglu (PhD)
4 min readMay 12, 2022

--

Today, with the advancement of technology, data is collected from every point of our lives. This data needs to be analyzed to provide great insights and effective trends. At this point, there is a high requirement for data scientists who reveal the information stored by the data. That’s why data science has become a high-volume keyword. Data scientists explore data extensively and ensure that data is carefully processed and reviewed.

The data may not be suitable for analysis and interpretation the first time. Therefore, it is necessary to transform the raw data into a form that makes it easy to understand and interpret, i.e. reorganize, sort and process the data to provide insightful information about the data provided. This blog post will share the codes you can use to perform a descriptive analysis of a dataset with Python.

Descriptive Analysis

When you first encounter a data set, it is appropriate to first perform a descriptive analysis of the data set. Descriptive analysis is one of the essential steps in performing statistical data analysis. It gives you a conclusion about the distribution of your data, helps you detect typos and outliers, and allows you to identify similarities between variables, preparing you for further statistical analysis.

Firstly, let’s read the data from an Excel file;

Code:

import pandas as pd
df = pd.read_excel("C:/Data_1.xlsx")
df.head(5)

Output:

1- Summarize a specific column

Code:

df.Result.describe()

Output:

count    160.000000
mean 0.035109
std 0.038749
min -0.009554
25% 0.009352
50% 0.025866
75% 0.040660
max 0.197265

2- Summarize per group in a specific column

Code:

df.groupby("Treatment").Result.describe()

Output:

3- Mean per group in a specific column

Code:

df.groupby("Treatment").Result.mean()

Output:

Treatment
Tr-1 0.044938
Tr-2 0.013218
Tr-3 0.049256
Tr-4 0.033024

If you don’t have groups, you can simply code as follows:

Code:

mean = df.Result.mean()
print('Mean: %.3f' % median)

Output:

Mean: 0.026

4- Median per group in a specific column

Code:

df.groupby("Treatment").Result.median()

Output:

Treatment
Tr-1 0.030496
Tr-2 0.008880
Tr-3 0.034329
Tr-4 0.026141

If the median is not much different from the mean, the sample has a Gaussian distribution. If the data have a different (non-Gaussian) distribution, the median may be very different from the mean, perhaps a better reflection of the central tendency of the underlying population.

5- Variance per group in a specific column

Code:

df.groupby("Treatment").Result.var()

Output:

Treatment
Tr-1 0.002645
Tr-2 0.000257
Tr-3 0.001346
Tr-4 0.001073

The concept of variance is related to how far each value of the distribution is from the mean of the distribution. The variance measures the mean value of these deviations. The variance is the mean of squares minus the mean squared. A low variance will have values grouped around the mean, whereas a high variance will have values spread out from the mean.

6- Standard deviation per group in a specific column

Code:

df.groupby("Treatment").Result.std()

Output:

Treatment
Tr-1 0.051429
Tr-2 0.016018
Tr-3 0.036690
Tr-4 0.032762

Standard deviation is a measure of variability. It refers to the extent to which our data changes around the arithmetic mean. Standard deviation is extremely useful as it shows the homogeneity of the data on their center axis. The standard deviation, along with the mean, is the two key parameters required to specify any Gaussian distribution.

7- Five-Number Summary

Code:

from numpy import percentilequarters = percentile(df["Result"], [25, 50, 75])min_data, max_data = df["Result"].min(), df["Result"].max()print('- Min: %.5f' % min_data)
print('- Q1: %.5f' % quarters[0])
print('- Median: %.5f' % quarters[1])
print('- Q3: %.5f' % quarters[2])
print('- Max: %.5f' % max_data)

Output:

- Min: -0.00955
- Q1: 0.00935
- Median: 0.02587
- Q3: 0.04066
- Max: 0.19727

The five-number summary is a non-parametric data summary technique. The five-number summary includes the calculation of 5 summary statistical quantities:

1- Median: The middle value in the sample, also called the 50th percentile or the 2nd quartile.

2- 1st Quartile: The 25th percentile.

3- 3rd Quartile: The 75th percentile.

4- Minimum: The smallest observation in the sample.

5- Maximum: The largest observation in the sample.

8- Descriptive statistics using Researchpy

Code:

import researchpy as rp
rp.summary_cont(df["Result"])

Output:

With this function, we have reached statistical values that can be more meaningful in numerical values. These are variables, the number of observations, mean value and standard deviation of each value, standard error and confidence intervals.

While it may seem like an oversimplified process, conducting a descriptive analysis in statistics is part of any study design. Before a data scientist can perform multivariate linear regression analysis or establish confidence intervals with estimators, they need to know the content of their data. That’s why descriptive analysis is a very useful step in data analysis.

Follow me here to be updated on new python data analysis content.

--

--

Ugur Comlekcioglu (PhD)

You find articles about science, environment and critical thinking here.