# How to Calculate Descriptive Statistics in Python with Pandas DataFrame

The purpose of descriptive statistics is to summarise the statistical characteristics of the data in a meaningful way without inferring anything about them.

A descriptive statistic summarizes the central tendency (mean, median, mode), spread of the data (range, standard deviation, and variance), the shape of the data, and frequency of the data.

The `describe()`

function from pandas
calculates the descriptive statistics for a DataFrame.

The basic syntax for the `describe()`

function is,

```
# for all columns
df.describe()
# for specific column
df["column_name"].describe()
```

Where, `df`

is pandas DataFrame

The `describe()`

function calculates the following descriptive statistics for **numeric data** from a pandas DataFrame,

- Count
- Mean
- Standard deviation (
`std`

) - Minimum (
`min`

) - 25% (25th Percentile or First quartile)
- 50% (50th percentile or Second quartile or Median)
- 75% (75th Percentile or Third quartile)
- Maximum (
`max`

)

The `describe()`

function calculates the following descriptive statistics for **categorical data** (e.g. strings) from a pandas DataFrame,

- Count
- Unique (number of unique values)
- Top (most common value)
- Frequency (
`freq`

)

Note: By default,`describe()`

function returns descriptive statistics only for numerical columns if a data frame contains both numerical and categorical columns.

The following examples explain how to use the `describe()`

function to get descriptive statistics from a pandas DataFrame.

### Calculate descriptive statistics for numerical pandas DataFrame

Create a numerical pandas DataFrame,

```
# load package
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'Age':[25, 30, 20, 35, 38], 'Height':[5.5, 6.2, 5, 4.9, 5.9]})
# view DataFrame
Age Height
0 25 5.5
1 30 6.2
2 20 5.0
3 35 4.9
4 38 5.9
```

Calculate descriptive statistics,

```
df.describe()
Age Height
count 5.000000 5.000000
mean 29.600000 5.500000
std 7.300685 0.561249
min 20.000000 4.900000
25% 25.000000 5.000000
50% 30.000000 5.500000
75% 35.000000 5.900000
max 38.000000 6.200000
```

The `describe()`

function outputs the values for count, mean, Standard deviation (`std`

), minimum, maximum, and first
quartile (25%), median (50%), and third quartile (75%) values.

Calculate the **variance** using `var()`

function from pandas DataFrame,

```
df.var()
Age 53.300
Height 0.315
dtype: float64
```

Calculate the **range** (difference between max and min values) from pandas DataFrame,

```
# for Age variable
df.Age.max() - df.Age.min()
18
# for Height column
df.Height.max() - df.Height.min()
1.29
```

### Calculate descriptive statistics for categorical pandas DataFrame

Create a categorical pandas DataFrame,

```
# load package
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'school':['A', 'B', 'C', 'D', 'E'], 'state':["TX", "TX", "CA", "CA", "CA"],
'temp':["hot", "hot", "mild", "mild", "mild"]})
# view DataFrame
school state temp
0 A TX hot
1 B TX hot
2 C CA mild
3 D CA mild
4 E CA mild
```

Calculate descriptive statistics,

```
df.describe()
school state temp
count 5 5 5
unique 5 2 2
top A CA mild
freq 1 3 3
```

By default, the `describe()`

outputs the values for count, number of unique values, most common values (`top`

), and
frequency (`freq`

) of the most common value.

### Calculate descriptive statistics for mixed pandas DataFrame

By default, the `describe()`

function returns descriptive statistics for the numerical column if you have mixed data
types (numerical and categorical).

You can pass the `include='all`

parameter to `describe()`

function to get descriptive statistics for each data type

Create a mixed data type pandas DataFrame,

```
# load package
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'name':['A', 'B', 'C', 'D', 'E'], 'Age':[25, 30, 20, 35, 38], 'Height':[5.5, 6.2, 5, 4.9, 5.9]})
# view DataFrame
name Age Height
0 A 25 5.5
1 B 30 6.2
2 C 20 5.0
3 D 35 4.9
4 E 38 5.9
```

Calculate descriptive statistics for both numerical and categorical variables,

```
df.describe(include = "all")
name Age Height
count 5 5.000000 5.000000
unique 5 NaN NaN
top A NaN NaN
freq 1 NaN NaN
mean NaN 29.600000 5.500000
std NaN 7.300685 0.561249
min NaN 20.000000 4.900000
25% NaN 25.000000 5.000000
50% NaN 30.000000 5.500000
75% NaN 35.000000 5.900000
max NaN 38.000000 6.200000
```

