# Python: Why VIF Return Inf Value?

## Background

Variance Inflation Factor (VIF) is used for detecting multicollinearity in regression models. It measures how much the variance of a regression coefficient is inflated due to multicollinearity with other independent variables in the model.

In Python, VIF can be calculated using the `variance_inflation_factor()` function from the statsmodels package. However, you may encounter a situation where you get `Inf` (infinity) values as the VIF for some of the independent variables.

Identifying and removing the multicollinearity issues is essential for robust predictive modeling in machine learning.

This article explains the reasons behind `Inf` values for the VIF with an example analysis.

## Why `inf` values for VIF?

You can get `inf` values for VIF due to the perfect multicollinearity. This happens when two or more independent variables in a model are perfectly linearly dependent. That is, one independent variable in the model can be entirely predicted by another independent variable.

If you have multiple identical columns in the input dataset, there will be perfect multicollinearity.

In addition, high correlation (correlation coefficients close to 1 or -1) between the independent variables can also give very high VIF values that could lead to `inf` values.

## VIF calculation Example

The following example explains how you get the `inf` values for the VIF.

Create an example dataset,

``````# import package
import pandas as pd

# view
BP  Age  Weight   BSA  Dur  Pulse  Stress
0  105   47    85.4  1.75  5.1     63      33
1  115   49    94.2  2.10  3.8     70      14
2  116   49    95.3  1.98  8.2     72      10
3  117   50    94.7  2.01  5.8     73      99
4  112   51    89.4  1.89  7.0     72      95
``````

Create separate datasets for independent variables (Age, Weight, BSA, Dur, Pulse, Stress) and dependent variables (BP),

``````# independent variables
X = df[['Age', 'Weight', 'BSA', 'Dur', 'Pulse', 'Stress']]

# dependent variables
y = df['BP']
``````

Add duplicate independent variable to have perfect multicollinearity,

``````# independent variables
X['Age_dup'] = df['Age']

# view X

Age  Weight   BSA  Dur  Pulse  Stress  Age_dup
0   47    85.4  1.75  5.1     63      33       47
1   49    94.2  2.10  3.8     70      14       49
2   49    95.3  1.98  8.2     72      10       49
3   50    94.7  2.01  5.8     73      99       50
4   51    89.4  1.89  7.0     72      95       51
``````

Calculate multicollinearity using the `variance_inflation_factor()` function from the statsmodels package,

``````import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# fit the regression model
reg = sm.OLS(y, X).fit()
# get Variance Inflation Factor (VIF)
pd.DataFrame({'variables':X.columns[1:], 'VIF':[variance_inflation_factor(X.values, i+1) for i in range(len(X.columns[1:]))]})
variables       VIF
0       Age       inf
1    Weight  8.417035
2       BSA  5.328751
3       Dur  1.237309
4     Pulse  4.413575
5    Stress  1.834845
6   Age_dup       inf
``````

You can see that the VIF values for the Age and Agw_dups variables are `inf`. This is because both Age and Age_dups are perfectly linearly dependent (due to exact values in between those variables). It means that `Age` and `Age_dup` have perfect multicollinearity.

The multicollinearity issue in regression models can be identified by checking for duplicate columns and performing the pairwise correlation. The multicollinearity issue can be resolved by removing one of the variables causing the multicollinearity.