How to Calculate Descriptive Statistics in Python with Pandas DataFrame

Renesh Bedre    3 minute read

The purpose of descriptive statistics is to summarise the statistical characteristics of the data in a meaningful way without inferring anything about them.

A descriptive statistic summarizes the central tendency (mean, median, mode), spread of the data (range, standard deviation, and variance), the shape of the data, and frequency of the data.

The describe() function from pandas calculates the descriptive statistics for a DataFrame.

The basic syntax for the describe() function is,

# for all columns
df.describe()

# for specific column
df["column_name"].describe()

Where, df is pandas DataFrame

The describe() function calculates the following descriptive statistics for numeric data from a pandas DataFrame,

  • Count
  • Mean
  • Standard deviation (std)
  • Minimum (min)
  • 25% (25th Percentile or First quartile)
  • 50% (50th percentile or Second quartile or Median)
  • 75% (75th Percentile or Third quartile)
  • Maximum (max)

The describe() function calculates the following descriptive statistics for categorical data (e.g. strings) from a pandas DataFrame,

  • Count
  • Unique (number of unique values)
  • Top (most common value)
  • Frequency (freq)

Note: By default, describe() function returns descriptive statistics only for numerical columns if a data frame contains both numerical and categorical columns.

The following examples explain how to use the describe() function to get descriptive statistics from a pandas DataFrame.

Calculate descriptive statistics for numerical pandas DataFrame

Create a numerical pandas DataFrame,

# load package
import pandas as pd

# create a DataFrame
df = pd.DataFrame({'Age':[25, 30, 20, 35, 38], 'Height':[5.5, 6.2, 5, 4.9, 5.9]})

# view DataFrame
   Age  Height
0   25     5.5
1   30     6.2
2   20     5.0
3   35     4.9
4   38     5.9

Calculate descriptive statistics,

df.describe()

             Age    Height
count   5.000000  5.000000
mean   29.600000  5.500000
std     7.300685  0.561249
min    20.000000  4.900000
25%    25.000000  5.000000
50%    30.000000  5.500000
75%    35.000000  5.900000
max    38.000000  6.200000

The describe() function outputs the values for count, mean, Standard deviation (std), minimum, maximum, and first quartile (25%), median (50%), and third quartile (75%) values.

Calculate the variance using var() function from pandas DataFrame,

df.var()

Age       53.300
Height     0.315
dtype: float64

Calculate the range (difference between max and min values) from pandas DataFrame,

# for Age variable
df.Age.max() - df.Age.min()

18

# for Height column
df.Height.max() - df.Height.min()

1.29

Calculate descriptive statistics for categorical pandas DataFrame

Create a categorical pandas DataFrame,

# load package
import pandas as pd

# create a DataFrame
df = pd.DataFrame({'school':['A', 'B', 'C', 'D', 'E'], 'state':["TX", "TX", "CA", "CA", "CA"], 
                   'temp':["hot", "hot", "mild", "mild", "mild"]})

# view DataFrame
  school state  temp
0      A    TX       hot
1      B    TX       hot
2      C    CA       mild
3      D    CA       mild
4      E    CA       mild

Calculate descriptive statistics,

df.describe()

       school state  temp
count       5     5     5
unique      5     2     2
top         A    CA  mild
freq        1     3     3

By default, the describe() outputs the values for count, number of unique values, most common values (top), and frequency (freq) of the most common value.

Calculate descriptive statistics for mixed pandas DataFrame

By default, the describe() function returns descriptive statistics for the numerical column if you have mixed data types (numerical and categorical).

You can pass the include='all parameter to describe() function to get descriptive statistics for each data type

Create a mixed data type pandas DataFrame,

# load package
import pandas as pd

# create a DataFrame
df = pd.DataFrame({'name':['A', 'B', 'C', 'D', 'E'], 'Age':[25, 30, 20, 35, 38], 'Height':[5.5, 6.2, 5, 4.9, 5.9]})

# view DataFrame
 name  Age  Height
0    A   25     5.5
1    B   30     6.2
2    C   20     5.0
3    D   35     4.9
4    E   38     5.9

Calculate descriptive statistics for both numerical and categorical variables,

df.describe(include = "all")

       name        Age    Height
count     5   5.000000  5.000000
unique    5        NaN       NaN
top       A        NaN       NaN
freq      1        NaN       NaN
mean    NaN  29.600000  5.500000
std     NaN   7.300685  0.561249
min     NaN  20.000000  4.900000
25%     NaN  25.000000  5.000000
50%     NaN  30.000000  5.500000
75%     NaN  35.000000  5.900000
max     NaN  38.000000  6.200000

Enhance your skills with courses Python


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.