Checking and handling missing values (NaN) in pandas

Renesh Bedre    3 minute read

  • In pandas dataframe the NULL or missing values (missing data) are denoted as NaN. Sometimes, Python None can also be considered as missing values.
  • Missing values in datasets can cause the complication in data handling and analysis, loss of information and efficiency, and can produce biased results.
  • It is necessary to check the missing data in datasets for feature engineering such as imputation of missing data, dropping the records with missing data, etc.

Check NULL or missing (NaN) values in pandas series

Create the pandas series with missing (NaN) values,

import pandas as pd
import numpy as np
data = {'a': 1, 'b': 2, 'c': 3, 'd': np.nan, 'e':4, 'f':np.nan}
ser = pd.Series(data=data)
ser
# output
a    1.0
b    2.0
c    3.0
d    NaN
e    4.0
f    NaN
dtype: float64

Check missing values in pandas series with isnull() function,

ser.isna()
# output
a    False
b    False
c    False
d     True
e    False
f     True
dtype: bool
# missing values are indicated by True

Count the missing values in pandas series using the sum() function

ser.isna().sum()
# output
2

Check NULL or missing (NaN) values in pandas dataframe

Create the pandas dataframe with missing (NaN) values,

import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['A', 'B', 'C', 'D', 'E'], 'col2':[1,2,3,4,np.nan], 'col3':[0.1,np.nan,0.3,0.4,0.5]})
df
# output
  col1  col2  col3
0    A   1.0   0.1
1    B   2.0   NaN
2    C   3.0   0.3
3    D   4.0   0.4
4    E   NaN   0.5

Check the missing values in pandas dataframe using isnull() function

df.isna()
# output
    col1   col2   col3
0  False  False  False
1  False  False   True
2  False  False  False
3  False  False  False
4  False   True  False
# missing values are indicated by True

# check if overall dataframe has any missing values
df.isna().any().any()
# output
True

Count the missing values in each column in the pandas dataframe using the sum() function

df.isna().sum()
# output
col1    0
col2    1
col3    1
dtype: int64

Handling missing values in pandas dataframe

Drop the missing values in pandas dataframe using the dropna() function

df.dropna()
# output
  col1  col2  col3
0    A   1.0   0.1
2    C   3.0   0.3
3    D   4.0   0.4
# it drops a complete row where missing value is present in any column

Fill the missing values with average or median values. This is also called the imputation of missing values.

# fill each column missing values with average value for that column
df.fillna(df.mean())
# output
  col1  col2  col3
0    A   1.0  0.10
1    B   2.0  0.35
2    C   3.0  0.30
3    D   4.0  0.40
4    E   2.5  0.50

# fill each column missing values with median value for that column
df.fillna(df.median())
  col1  col2  col3
0    A   1.0  0.10
1    B   2.0  0.35
2    C   3.0  0.30
3    D   4.0  0.40
4    E   2.5  0.50

For a categorical variable, the mode (most frequent value) can be used for filling the missing values

# create dataframe with a categorical variable
df1 = pd.DataFrame({'col1':['A', 'B', 'C', 'D', 'E'], 'col2':[1,2,3,4,np.nan], 'col3':[0,np.nan,0,0,1]})
# make col3 as a categorical variable
df1['col3'] = df1.col3.astype('category')
# impute missing value in col3 with mode
df1['col3'].fillna(value=df1['col3'].mode()[0], inplace=True)
df1
# output
  col1  col2 col3
0    A   1.0  0.0
1    B   2.0  0.0
2    C   3.0  0.0
3    D   4.0  0.0
4    E   NaN  1.0

Fill the missing values with any constant values

# fill col2 missing values with 0
df['col2'].fillna(value=0, inplace=True)
df
# output
  col1  col2  col3
0    A   1.0   0.1
1    B   2.0   NaN
2    C   3.0   0.3
3    D   4.0   0.4
4    E   0.0   0.5

Fill the missing value with the non-missing value that appears before the missing value

df.fillna(method='ffill', inplace=True)
df
# output
  col1  col2  col3
0    A   1.0   0.1
1    B   2.0   0.1
2    C   3.0   0.3
3    D   4.0   0.4
4    E   4.0   0.5

Fill the missing value with the non-missing value that appear after the missing value

df.fillna(method='bfill', inplace=True)
df
# output
  col1  col2  col3
0    A   1.0   0.1
1    B   2.0   0.3
2    C   3.0   0.3
3    D   4.0   0.4
4    E   NaN   0.5

See more parameters at pandas fillna usage

References

  1. Barnard J, Meng XL. Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical methods in medical research. 1999 Feb;8(1):17-36.

Learn more about Python

This work is licensed under a Creative Commons Attribution 4.0 International License