
Uncovering the Hidden Dangers of Multicollinearity
In this blog, we are going to learn
- What is Multicollinearity?
- What causes Multicollinearity?
- What are the effects of Multicollinearity?
- What are the Remedies for Multicollinearity
- What is Ridge regression?
- How we can remove Multicollinearity using Ridge regression?
Multicollinearity
Multicollinearity is a phenomenon that occurs when two or more predictor variables in a regression model are highly correlated. It is a common problem in statistical analysis and can lead to inaccurate results. In this blog, we will discuss the causes, effects, and remedies of multicollinearity.
What is Multicollinearity?
Multicollinearity is a phenomenon that occurs when two or more predictor variables in a regression model are highly correlated. This means that the predictor variables are linearly related to each other, meaning that if one variable increases, the other variable also increases. This can lead to inaccurate results in the regression model because the predictor variables are not independent of each other.
What Causes Multicollinearity?
Multicollinearity can occur for a variety of reasons. One common cause is when two predictor variables are measuring the same thing. For example, if you are trying to predict a person’s income and you include both their salary and their bonus as predictor variables, they are likely to be highly correlated. Another common cause is when two predictor variables are related to each other in some way. For example, if you are trying to predict a person’s height and you include both their age and their weight as predictor variables, they are likely to be highly correlated.
What are the Effects of Multicollinearity?
The main effect of multicollinearity is that it can lead to inaccurate results in the regression model. This is because the predictor variables are not independent of each other. This means that the model cannot accurately determine the effect of each predictor variable on the outcome variable. As a result, the model’s predictions may be inaccurate.
What are the Remedies for Multicollinearity?
- There are several remedies for multicollinearity. One of the most common remedies is to use principal component analysis (PCA). PCA is a statistical technique that can be used to reduce the number of predictor variables in a regression model. It works by combining the predictor variables into a smaller number of components that are uncorrelated with each other. This can help reduce the effects of multicollinearity and improve the accuracy of the regression model.
- Another remedy is to use ridge regression. Ridge regression is a type of regression model that is designed to reduce the effects of multicollinearity. It works by adding a penalty term to the regression model that penalizes the model for including highly correlated variables. This helps reduce the effects of multicollinearity and can improve the accuracy of the model.
- Finally, another remedy is to use variable selection techniques. Variable selection techniques are used to identify the most important predictor variables in a regression model. These techniques can help reduce the effects of multicollinearity by selecting only the most important predictor variables. This can help improve the accuracy of the regression model.
Ridge regression
Ridge Regression, also known as Tikhonov regularization or L2 regularization, is a regularization technique used to address multicollinearity in multiple linear regression models. It works by adding a penalty term to the linear regression loss function, which discourages the model from assigning large coefficients to the predictor variables. This penalty term is proportional to the square of the magnitude of the coefficients, hence the name L2 regularization.
Ridge Regression loss function
The Ridge Regression loss function is given by:
L(θ) = ∑(yᵢ – (θ₀ + θ₁x₁ᵢ + … + θₚxₚᵢ))² + λ∑(θⱼ²)
Here, L(θ) represents the loss function, yᵢ is the observed response, xᵢ is the vector of predictor values for observation i, θ₀ is the intercept, θⱼ is the coefficient for predictor j, λ is the regularization parameter, and p is the number of predictors.
The regularization parameter λ determines the strength of the penalty for large coefficients. When λ is equal to zero, Ridge Regression is equivalent to linear regression. As λ increases, the coefficients are forced to shrink, which reduces the impact of multicollinearity on the model.
Benefits and limitations
Ridge Regression has some key benefits:
- It reduces the effect of multicollinearity by shrinking the coefficients of correlated predictors.
- It provides a stable solution even when the predictor variables are highly correlated.
- It can prevent overfitting by regularizing the model complexity.
However, Ridge Regression also has some limitations:
- It does not perform variable selection, meaning that all predictors remain in the model, even if their contribution is small.
- The choice of the regularization parameter λ can significantly affect the model performance.
First, we are going to download the dataset. For this, we are going to use the pandas read_csv() method. Once we have the dataset, we are going to drop any null rows by using dropna().
Finally, we are going to drop the column “date” using the drop() method.
Dataset
Data Set Information:
This data is for the purpose of bias correction of next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model’s next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.
Attribute Information:
1. station - used weather station number: 1 to 25 2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30') 3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (°C): 20 to 37.6 4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (°C): 11.3 to 29.9 5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5 6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100 7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (°C): 17.6 to 38.5 8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (°C): 14.3 to 29.6 9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9 10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4 11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97 12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97 13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98 14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97 15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7 16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6 17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8 18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7 19. lat - Latitude (°): 37.456 to 37.645 20. lon - Longitude (°): 126.826 to 127.135 21. DEM - Elevation (m): 12.4 to 212.3 22. Slope - Slope (°): 0.1 to 5.2 23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9 24. Next_Tmax - The next-day maximum air temperature (°C): 17.4 to 38.9 25. Next_Tmin - The next-day minimum air temperature (°C): 11.3 to 29.8
import pandas as pd import numpy as np df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00514/Bias_correction_ucl.csv") df=df.dropna() df=df.drop(columns=['Date']) df
Scikit-learn is a popular Python library for machine learning, and it provides an easy-to-use implementation of Ridge Regression. Here’s an example of how to use it.
# Import Libraries import pandas as pd from sklearn.linear_model import Ridge # Data Processing y=df.pop('Next_Tmin') X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.3) # Fit the Ridge model ridge = Ridge(alpha=1.0) ridge.fit(X_train, y_train) # Evaluate the model y_pred = ridge.predict(X_test) ridge.score(X_test,y_test)
The next step would be to select the regularization parameter (alpha) with cross-validation. For this, Scikit-Learn has provided us with sklearn.linear_model.RidgeCV
Choosing the right value for the regularization parameter alpha is crucial for the performance of the Ridge Regression model. One common method to determine the best alpha value is using cross-validation. Scikit-learn provides a convenient function called RidgeCV that performs cross-validation to find the optimal alpha value.
# Import Libraries from sklearn.linear_model import RidgeCV # Define the range of alpha values to consider alphas = np.logspace(-5, 5, 100) # Fit the RidgeCV model ridge_cv = RidgeCV(alphas=alphas, cv=3) # 3-fold cross-validation ridge_cv.fit(X_train, y_train) # Get the best alpha value best_alpha = ridge_cv.alpha_ print("Best alpha:", best_alpha) # Evaluate the model with the best alpha value y_pred = ridge_cv.predict(X_test)
Running the code above gives us an alpha value of 11.49. Now we can create a final model to implement a Ridge Classifier to remove multicollinearity.
Summary
- We have learned about Multicollinearity.
- We have learned about what causes Multicollinearity.
- We have learned about what are the effects of Multicollinearity.
- We have learned about what are the Remedies for Multicollinearity.
- We have learned about Ridge regression.
- We have learned about how to remove Multicollinearity using Ridge regression.
Steps to remove multicollinearity
- Data preparation
- Identifying multicollinearity
- Applying Ridge Regression
- Evaluating and comparing models
Conclusion
In conclusion, multicollinearity is a common problem in statistical analysis and can lead to inaccurate results. It occurs when two or more predictor variables in a regression model are highly correlated. The effects of multicollinearity can be reduced by using principal component analysis, ridge regression, and variable selection techniques. These techniques can help reduce the effects of multicollinearity and improve the accuracy of the regression model.
Must Read
- Unlock the Power of Feature Selection.
- Top 8 Cross Validation methods!!!.
- Non Linear Transformations.
- Feature Scaling : Data Normalization vs Data Standardization.
- Best Methods to Convert Categorical Data for Machine Learning.