
Regression Metrics Demystified
In this blog, we are going to learn about
- What are Regression Metrics?
- What are some Cross-validation and Regression Metrics?
- What are Advanced Regression Metrics?
Introduction
In this in-depth guide, we will explore the various metrics used to evaluate the performance of regression models in machine learning. We will discuss the intuition behind each metric, their advantages, and disadvantages, and provide practical examples using Python and Scikit-learn.
Why Regression Metrics Matter
Regression metrics are essential for assessing the quality of a regression model and comparing different models. They provide a quantitative measure of how well a model predicts continuous target variables and help identify areas of improvement. Understanding the various metrics and their characteristics will enable you to choose the most appropriate metric for your specific problem.
Common Regression Metrics
Mean Squared Error (MSE)
Mean squared error (MSE) is the average of the squared differences between the predicted and true values. It is widely used and straightforward to understand. However, since it squares the errors, MSE is sensitive to outliers and may not be the best choice when dealing with data containing extreme values.
Root Mean Squared Error (RMSE)
Root mean squared error (RMSE) is the square root of the mean squared error. It has the same units as the target variable, making it easier to interpret. Like MSE, RMSE is also sensitive to outliers.
Mean Absolute Error (MAE)
Mean absolute error (MAE) is the average of the absolute differences between the predicted and true values. MAE is less sensitive to outliers than MSE and RMSE, making it more robust in the presence of extreme values.
R-squared (Coefficient of Determination)
R-squared measures the proportion of variance in the target variable explained by the input features. It ranges from 0 to 1, with higher values indicating better model performance. However, R-squared can be misleading when used with a few features or when the model is overfitting.
Adjusted R-squared
Adjusted R-squared is a modified version of R-squared that takes the number of features into account. It penalizes models with a large number of features, making it more suitable for feature selection and avoiding overfitting.
Comparing Regression Metrics
Scale Sensitivity
MSE, RMSE, and MAE are sensitive to the scale of the target variable, meaning their values will change if the target variable is transformed. On the other hand, R-squared and adjusted R-squared are scale-independent.
Sensitivity to Outliers
MSE and RMSE are sensitive to outliers due to squaring the errors, while MAE is more robust. If your data contains outliers, you may prefer to use MAE or consider preprocessing your data to address the outliers before using MSE or RMSE.
Interpretability
RMSE and MAE are more interpretable than MSE because they have the same units as the target variable. R-squared and adjusted R-squared provide a relative measure of model performance, which can be useful for comparing models but may be less interpretable in isolation.
Practical Examples with Scikit-learn
Calculating Regression Metrics
In this example, we will demonstrate how to calculate various regression metrics using Scikit-learn.
Dataset
Data Set Information:
The estimated relative performance values were estimated by the authors using a linear regression method. See their article (pp 308-313) for more details on how the relative performance values were set.
Attribute Information:
1. vendor name: 30 (adviser, amdahl,apollo, basf, bti, burroughs, c.r.d, cambex, cdc, dec, dg, formation, four-phase, gould, honeywell, hp, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang) 2. Model Name: many unique symbols 3. MYCT: machine cycle time in nanoseconds (integer) 4. MMIN: minimum main memory in kilobytes (integer) 5. MMAX: maximum main memory in kilobytes (integer) 6. CACH: cache memory in kilobytes (integer) 7. CHMIN: minimum channels in units (integer) 8. CHMAX: maximum channels in units (integer) 9. PRP: published relative performance (integer) 10. ERP: estimated relative performance from the original article (integer)
import pandas as pd columns=['vendor_name','Model','MYCT','MMIN','MMAX','CACH','CHMIN','CHMAX','PRP','ERP'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/cpu-performance/machine.data",names=columns) df
from sklearn.preprocessing import OrdinalEncoder ordinalencoder=OrdinalEncoder() transformed_data=ordinalencoder.fit_transform(df[['vendor_name','Model']]) df[['vendor_name','Model']]=transformed_data df
y=df.pop('ERP') #Split the data into training and testing sets X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.1, random_state=42)
#Data Normalization from sklearn.preprocessing import StandardScaler standard_scaler = StandardScaler() standard_scaler.fit(X_train) X_train_sscaler=standard_scaler.transform(X_train) X_test_sscaler=standard_scaler.transform(X_test)
import numpy as np from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score # Train a linear regression model model = LinearRegression() model.fit(X_train_sscaler, y_train) #Make predictions y_pred = model.predict(X_test_sscaler) #Calculate regression metrics mse = mean_squared_error(y_test, y_pred) rmse = np.sqrt(mse) mae = mean_absolute_error(y_test, y_pred) r2 = r2_score(y_test, y_pred) print(f"MSE: {mse:.2f}, nRMSE: {rmse:.2f}, nMAE: {mae:.2f}, nR-squared: {r2:.2f}")
In this example, we create a synthetic dataset, fit a linear regression model, and calculate the mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE), and R-squared using Scikit-learn.
Cross-validation and Regression Metrics
Cross-validation is an essential technique for assessing the performance of a regression model on unseen data. In this example, we will demonstrate how to perform cross-validation and calculate regression metrics using Scikit-learn.
from sklearn.model_selection import cross_val_score # Perform 5-fold cross-validation scores = cross_val_score(model, df, y, cv=5, scoring='neg_mean_squared_error') # Convert negative MSE to positive and calculate RMSE rmse_scores = np.sqrt(-scores) print(f"RMSE scores: {rmse_scores}") print(f"Average RMSE: {np.mean(rmse_scores):.2f}")
In this example, we use the same synthetic dataset and linear regression model as before. We perform 5-fold cross-validation and calculate the root mean squared error (RMSE) for each fold. The average RMSE across the folds provides an estimate of the model’s performance on unseen data.
Advanced Regression Metrics
Mean Squared Logarithmic Error (MSLE)
Mean squared logarithmic error (MSLE) is similar to MSE but uses the logarithm of the predicted and true values. MSLE is less sensitive to large errors and is more suitable for datasets with a wide range of target values or when the relative error is more important than the absolute error.
Root Mean Squared Logarithmic Error (RMSLE)
Root mean squared logarithmic error (RMSLE) is the square root of MSLE. Like MSLE, RMSLE is less sensitive to large errors and is more appropriate for datasets with a wide range of target values or when the relative error is more important than the absolute error.
Mean Absolute Percentage Error (MAPE)
Mean absolute percentage error (MAPE) is the average of the absolute percentage differences between the predicted and true values. MAPE is scale-independent and is often used to measure the relative performance of forecasting models. However, it can be undefined or biased if the true values contain zeros.
import numpy as np from sklearn.datasets import make_regression from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_log_error, mean_squared_log_error, mean_absolute_percentage_error # Train a linear regression model model = RandomForestRegressor() model.fit(X_train_sscaler, y_train) #Make predictions y_pred = model.predict(X_test_sscaler) #Calculate regression metrics msle = mean_squared_log_error(y_test, y_pred) rmsle = np.sqrt(msle) mape = mean_absolute_percentage_error(y_test, y_pred) print(f"MSLE: {mse:.2f}, nRMSLE: {rmsle:.2f}, nMAPE: {mape:.2f}")
Conclusion
Understanding regression metrics is crucial for evaluating and comparing machine learning models. This comprehensive guide, along with the provided Scikit-learn examples, serves as a foundation for understanding and applying regression metrics to your machine-learning projects. By exploring different metrics and their characteristics, you can choose the most appropriate metric for your specific problem and build better models that generalize well to unseen data.
Summary
- We have learned about Regression metrics.
- We have learned about classification metrics important.
- We have learned about common classification metrics.
- We have learned about advanced Classification Metrics.
Must Read
- A Deep Dive into AUC-ROC Curve Analysis
- The Art of Model Evaluation: How to Interpret AUC ROC
- Perfecting the F1 Score: Optimizing Precision and Recall for Machine Learning
- Mastering the Balance for Optimal Machine Learning Performance
- Mastering VIF in Machine Learning for Robust Model Performance
- Harness the Power of PCA in Machine Learning
- Uncovering the Hidden Dangers of Multicollinearity.
- Unlock the Power of Feature Selection.
API’s
Quiz Time
Test your understanding of Regression Metrics concepts and prepare well for interviews.