
Mastering VIF in Machine Learning for Robust Model Performance
In this blog, you will learn about
- What is VIF?
- How to calculate VIF?
- How to interpret VIF values?
- How to perform VIF?
VIF
Variance Inflation Factor (VIF) is a measure used to quantify the severity of multicollinearity in a multiple linear regression model. It indicates the extent to which the variance of a regression coefficient is inflated due to multicollinearity among the predictor variables. In other words, VIF helps assess the impact of multicollinearity on the stability and reliability of a model’s coefficients.
Calculating VIF
To calculate VIF, we perform the following steps for each predictor variable:
- Run a linear regression using the predictor of interest as the response variable and all other predictor variables as independent variables.
- Calculate the coefficient of determination (R²) for the regression.
- Compute the VIF value using the formula: VIF = 1 / (1 – R²)
For example, let’s assume we have a multiple linear regression model with three predictor variables: X₁, X₂, and X₃. To calculate the VIF for X₁, we would perform a linear regression using X₁ as the response variable and X₂ and X₃ as independent variables. We would then compute the R² value for this regression and use it in the VIF formula.
Interpretation of VIF values
VIF values help assess the extent of multicollinearity in a regression model. Higher VIF values indicate higher multicollinearity, which implies that the associated regression coefficient is less reliable.
Here is a general guideline for interpreting VIF values:
- VIF = 1: No multicollinearity.
- 1 < VIF < 5: Moderate multicollinearity, which might not be a severe issue depending on the context.
- VIF >= 5: High multicollinearity, which could be problematic and should be addressed.
These are general guidelines, and the specific threshold for determining multicollinearity may vary depending on the problem and domain knowledge.
Threshold for VIF
While there is no universally accepted threshold for VIF, a common rule of thumb is to consider a VIF value of 5 or greater as an indicator of high multicollinearity. In some cases, a more conservative threshold of 10 might be used. It’s essential to consider the context of the analysis and domain knowledge when determining an appropriate threshold.
Dataset
Data Set Information:
This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model’s next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.
Attribute Information:
1. station - used weather station number: 1 to 25 2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30') 3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (°C): 20 to 37.6 4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (°C): 11.3 to 29.9 5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5 6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100 7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (°C): 17.6 to 38.5 8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (°C): 14.3 to 29.6 9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9 10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4 11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97 12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97 13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98 14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97 15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7 16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6 17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8 18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7 19. lat - Latitude (°): 37.456 to 37.645 20. lon - Longitude (°): 126.826 to 127.135 21. DEM - Elevation (m): 12.4 to 212.3 22. Slope - Slope (°): 0.1 to 5.2 23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9 24. Next_Tmax - The next-day maximum air temperature (°C): 17.4 to 38.9 25. Next_Tmin - The next-day minimum air temperature (°C): 11.3 to 29.8
#Import Libraries import numpy as np import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split #Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00514/Bias_correction_ucl.csv") df=df.dropna() df=df.drop(columns=['Date']) df
First, we are going to import all the necessary libraries to perform VIF. We are going to import “variance_inflation_factor” from the statsmodels.
import statsmodels.api as sm from statsmodels.stats.outliers_influence import variance_inflation_factor
Once we have imported all the necessary libraries then we are going to create a dataset to perform VIF. In this we are going to select all the attributes excluding the target column i.e. “Next_Tmin”.
X = df[list(df.columns[:-1])]
Now we have the dataset, we are going to perform VIF on each attribute. We are going to save the values of the VIF in a data frame. We will sort the values of the data frame using the “sort_values” method. Now let us look at the code.
vif_df = pd.DataFrame() vif_df['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif_df['Attributes'] = X.columns vif_df=vif_df.sort_values('VIF', ascending=False).reset_index(drop=True) vif_df
Running the above code, we can see that the attributes “lat” and “lon” have the highest VIF. These two attributes are highly correlated. Now, we have two options to deal with multicollinearity.
- Removing all the attributes at once where the VIF value is greater than 5: Going forward with his strategy might lead to more loss of information.
- Removing attributes where the VIF is greater than 5 one by one and then calculating VIF again: With this strategy information loss is minimized.
In order to remove the attributes one by one, we are going to use a while loop. Keep removing the attributes from the data frame where VIF is greater than 5. Let us move forward and see the code.
while(True): vif_df = pd.DataFrame() vif_df['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])] vif_df['attributes'] = X.columns vif_df=vif_df.sort_values('VIF', ascending=False) if vif_df['VIF'].values[0]>6: print("Removing :",vif_df['attributes'].values[0],"as VIF value is", vif_df['VIF'].values[0]) X=X.drop(columns=vif_df['attributes'].values[0]).reset_index(drop=True) continue else: break
Running the code above, we can see that we have removed a total of 13 attributes where the VIF value was greater than 5.
vif_df.reset_index(drop=True)
Now we are going to create two models:
- Training the model with all the attributes and checking the MSE.
- Training the model with attributes where the VIF is less than 5.
First, we will see how our model performs with all the attributes.
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import warnings warnings.filterwarnings('ignore') #Data Splitting X_train, X_test, y_train, y_test = train_test_split(df, y) #Data Engineering encoder = StandardScaler() encoder.fit(X_train) X_train[X_train.columns] = encoder.transform(X_train) X_test[X_test.columns] = encoder.transform(X_test) #Model Implementation model = LinearRegression() model.fit(X_train, y_train) #Model Scoring predictions=model.predict(X_test) mean_squared_error(predictions,y_test)
Taking all the attributes has given us an MSE score of 0.942. Now we will check the model performance with the selected attributes.
from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import warnings warnings.filterwarnings('ignore') X_train, X_test, y_train, y_test = train_test_split(df[vif_df.attributes.values], y) #Data Engineering encoder = StandardScaler() encoder.fit(X_train) X_train[X_train.columns] = encoder.transform(X_train) X_test[X_test.columns] = encoder.transform(X_test) #Model Implementation model = LinearRegression() model.fit(X_train, y_train) #Model Scoring predictions=model.predict(X_test) mean_squared_error(predictions,y_test)
Running the above code has given us an MSE score of 6.070. This shows that removing the attributes have to lead to a loss of information and hampered the predictive power of the model. This has led to a decrease in the model’s performance.
NOTE: The result might change depending upon the selection of the machine learning algorithm and the dataset.
Summary
- We have learned about VIF.
- We have learned about calculating VIF.
- We have learned about interpreting VIF values.
- We have learned about performing VIF on the dataset.
Conclusion
Variance Inflation Factor (VIF) is an important tool for detecting multicollinearity in machine learning models. In this blog, we discussed VIF in detail, its importance, and how it can be calculated in Python using the statsmodels library. We also provided an example of using VIF in a linear regression model to predict the performance of students in a math exam.
It is important to note that while VIF can help us identify the presence of multicollinearity, it does not provide a solution to the problem. To address multicollinearity, we may need to remove correlated predictor variables or use regularization techniques such as ridge regression or LASSO regression. However, VIF can be a valuable first step in identifying the presence of multicollinearity in our data and taking appropriate steps to address it.