
Harness the Power of PCA in Machine Learning
In this blog, you are going to learn
- What is Principal Component Analysis (PCA)
- How does PCA works?
- What are the applications of PCA?
- How to apply PCA for Multicollinearity and dimensionality reduction?
Principal Component Analysis (PCA)
Introduction
Principal Component Analysis (PCA) is a powerful and widely used technique in machine learning. It is a dimensionality reduction technique which is used to reduce the number of variables in a dataset while still preserving the most important information. PCA is a linear transformation technique which is used to reduce the number of features in a dataset while still preserving the most important information.
PCA is used in a wide range of applications including data compression, feature selection, feature extraction, and dimensionality reduction. It is a powerful tool for data analysis and can be used to identify patterns in data, reduce the complexity of data, and improve the accuracy of machine learning models.
In this blog, we will discuss the basics of PCA, how it works, and its applications in machine learning. We will also discuss the advantages and disadvantages of PCA, and how it can be used to improve the accuracy of machine learning models.
What is PCA?
Principal Component Analysis (PCA) is a dimensionality reduction technique that is used to reduce the number of variables in a dataset while still preserving the most important information. PCA is a linear transformation technique that is used to reduce the number of features in a dataset while still preserving the most important information.
The goal of PCA is to reduce the number of variables in a dataset while still preserving the most important information. PCA works by finding a set of orthogonal (uncorrelated) components which explain the most variance in the data. These components are called principal components and they are used to represent the data in a lower dimensional space.
How does PCA work?
PCA works by finding a set of orthogonal (uncorrelated) components which explain the most variance in the data. These components are called principal components and they are used to represent the data in a lower dimensional space.
- The first step in PCA is to calculate the covariance matrix of the data. The covariance matrix is a square matrix which contains the pairwise covariances between all the variables in the dataset. The covariance matrix is then used to calculate the eigenvectors and eigenvalues of the data.
- The eigenvectors are the directions in which the data varies the most. The eigenvalues are the amount of variance explained by each eigenvector. The eigenvectors and eigenvalues are then used to calculate the principal components of the data.
- The principal components are the directions in which the data varies the most and they are used to represent the data in a lower dimensional space. The principal components are ordered in terms of the amount of variance they explain. The first principal component explains the most variance, the second principal component explains the second most variance, and so on.
Applications of PCA
PCA is a powerful tool for data analysis and can be used in a wide range of applications. Some of the most common applications of PCA include data compression, feature selection, feature extraction, and dimensionality reduction.
- Data Compression: PCA can be used for data compression. Data compression is the process of reducing the size of a dataset while still preserving the most important information. PCA can be used to reduce the number of variables in a dataset while still preserving the most important information.
- Feature Selection: PCA can also be used for feature selection. Feature selection is the process of selecting the most important features from a dataset. PCA can be used to identify the most important features from a dataset and reduce the number of features while still preserving the most important information.
- Feature Extraction: PCA can also be used for feature extraction. Feature extraction is the process of extracting the most important features from a dataset. PCA can be used to identify the most important features from a dataset and reduce the number of features while still preserving the most important information.
- Dimensionality Reduction: PCA can also be used for dimensionality reduction. Dimensionality reduction is the process of reducing the number of dimensions (variables) in a dataset while still preserving the most important information. PCA can be used to reduce the number of variables in a dataset while still preserving the most important information.
Dataset
Data Set Information:
This data is for the purpose of bias correction of the next-day maximum and minimum air temperatures forecast of the LDAPS model operated by the Korea Meteorological Administration over Seoul, South Korea. This data consists of summer data from 2013 to 2017. The input data is largely composed of the LDAPS model’s next-day forecast data, in-situ maximum and minimum temperatures of present-day, and geographic auxiliary variables. There are two outputs (i.e. next-day maximum and minimum air temperatures) in this data. Hindcast validation was conducted for the period from 2015 to 2017.
Attribute Information:
1. station - used weather station number: 1 to 25 2. Date - Present day: yyyy-mm-dd ('2013-06-30' to '2017-08-30') 3. Present_Tmax - Maximum air temperature between 0 and 21 h on the present day (°C): 20 to 37.6 4. Present_Tmin - Minimum air temperature between 0 and 21 h on the present day (°C): 11.3 to 29.9 5. LDAPS_RHmin - LDAPS model forecast of next-day minimum relative humidity (%): 19.8 to 98.5 6. LDAPS_RHmax - LDAPS model forecast of next-day maximum relative humidity (%): 58.9 to 100 7. LDAPS_Tmax_lapse - LDAPS model forecast of next-day maximum air temperature applied lapse rate (°C): 17.6 to 38.5 8. LDAPS_Tmin_lapse - LDAPS model forecast of next-day minimum air temperature applied lapse rate (°C): 14.3 to 29.6 9. LDAPS_WS - LDAPS model forecast of next-day average wind speed (m/s): 2.9 to 21.9 10. LDAPS_LH - LDAPS model forecast of next-day average latent heat flux (W/m2): -13.6 to 213.4 11. LDAPS_CC1 - LDAPS model forecast of next-day 1st 6-hour split average cloud cover (0-5 h) (%): 0 to 0.97 12. LDAPS_CC2 - LDAPS model forecast of next-day 2nd 6-hour split average cloud cover (6-11 h) (%): 0 to 0.97 13. LDAPS_CC3 - LDAPS model forecast of next-day 3rd 6-hour split average cloud cover (12-17 h) (%): 0 to 0.98 14. LDAPS_CC4 - LDAPS model forecast of next-day 4th 6-hour split average cloud cover (18-23 h) (%): 0 to 0.97 15. LDAPS_PPT1 - LDAPS model forecast of next-day 1st 6-hour split average precipitation (0-5 h) (%): 0 to 23.7 16. LDAPS_PPT2 - LDAPS model forecast of next-day 2nd 6-hour split average precipitation (6-11 h) (%): 0 to 21.6 17. LDAPS_PPT3 - LDAPS model forecast of next-day 3rd 6-hour split average precipitation (12-17 h) (%): 0 to 15.8 18. LDAPS_PPT4 - LDAPS model forecast of next-day 4th 6-hour split average precipitation (18-23 h) (%): 0 to 16.7 19. lat - Latitude (°): 37.456 to 37.645 20. lon - Longitude (°): 126.826 to 127.135 21. DEM - Elevation (m): 12.4 to 212.3 22. Slope - Slope (°): 0.1 to 5.2 23. Solar radiation - Daily incoming solar radiation (wh/m2): 4329.5 to 5992.9 24. Next_Tmax - The next-day maximum air temperature (°C): 17.4 to 38.9 25. Next_Tmin - The next-day minimum air temperature (°C): 11.3 to 29.8
First, we are going to download the dataset. For this, we are going to use the pandas read_csv() method. Once we have the dataset, we are going to drop any null rows by using dropna().
Finally, we are going to drop the column “date” using the drop() method.
#Import Libraries import numpy as np import seaborn as sns import matplotlib.pyplot as plt import pandas as pd from sklearn.preprocessing import StandardScaler from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA import warnings warnings.filterwarnings('ignore') #Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00514/Bias_correction_ucl.csv") df=df.dropna() df=df.drop(columns=['Date']) df
Now we are going to create a copy of the original dataset using the copy() method and we are going to call it df_pca.
df_pca=df.copy() df_pca
Now, are going to separate our target column from the dataset using the pop() method.
y=df_pca.pop('Next_Tmin') X_train, X_test, y_train, y_test = train_test_split(df_pca, y, test_size=0.3) #Data Normalization standardscaler = StandardScaler() X_train_ss = standardscaler.fit_transform(X_train) X_test_ss = standardscaler.transform(X_test)
# Data Correlation corr = df_pca.corr() plt.figure(figsize = (18,10)) # Heatmap sns.heatmap(round(corr, 3), annot = True, vmin=-1, vmax=1, cmap="YlGnBu", linewidths=.4) plt.grid(b=True, color='#f68c1f', alpha=0.2) plt.show()
Running the code above shows that the attributes “Next_Tmax”, “Present_Tmax”, “Present_Tmin”, “LDAPS_Tmax_lapse”, and “LDAPS_Tmin_lapse” are highly co-related which tells us the presence of multi-colinearity.
If we just remove the attributes, that would result in a loss of information. In order to remove the multicollinearity we are going to perform PCA on the dataset.
#PCA pca = PCA(n_components = X_train_ss.shape[1]) pca.fit_transform(X_train_ss) #Calculating Variance percent_variance_explained = pca.explained_variance_/(np.sum(pca.explained_variance_)) combined_variance_explained = np.cumsum(percent_variance_explained) #Plotting Variance plt.plot(combined_variance_explained) plt.grid() plt.xlabel("Number of Components") plt.ylabel("Percentage explained by Variance") plt.show()
combined_variance_explained
Running the code above, we can see that the total variance of data captured by 1st PCA is 0.25, for 1st two PCA is 0.38, 1st 20 PCA is 0.98.
pca.explained_variance_
Running the above code, we can see that the variance captured by the individual attributes is 5.76 for the 1st PCA, 3.07 for the second PCA, 1.95 for the third, and so on.
The first 20 components are giving us around 98% of the total variance, henceforth we are going to take the first 20 components.
pca = PCA(n_components=20) pca_train = pca.fit_transform(X_train_ss) pca_test = pca.transform(X_test_ss)
Now that we have performed PCA on the dataset, we are going to check if there is any correlation between the variables. We are going to add the target column “Next_Tmin” and then calculate the correlations using the pandas corr() method.
pca_train = pd.DataFrame(pca_train) pca_train["Next_Tmin"] = y_train corr = pca_train.corr() plt.figure(figsize = (25,15)) sns.heatmap(corr, annot = True, vmin=-1, vmax=1, linewidths=.6) plt.grid(b=True, color='#f68c1f', alpha=0.2) plt.show()
Running the above code showed us that all the attributes are independent of each other and multi-colinearity has been removed.
Summary
- We have learned about Principal Component Analysis (PCA)
- We have learned about the workings of PCA.
- We have learned about the applications of PCA.
- We have learned about how PCA helps in Multicollinearity and dimensionality reduction.
Advantages and Disadvantages of PCA
PCA is a powerful tool for data analysis and can be used in a wide range of applications. However, there are some advantages and disadvantages of PCA that should be considered.
Advantages
- The main advantage of PCA is that it can reduce the number of variables in a dataset while still preserving the most important information. This can make the data easier to analyze and can improve the accuracy of machine learning models.
- PCA is also a fast and efficient technique that can be used to reduce the dimensionality of large datasets. This can make the data easier to analyze and can improve the accuracy of machine learning models.
Disadvantages
- The main disadvantage of PCA is that it can only be used for linear transformations. This means that it cannot be used to capture non-linear relationships in the data.
- PCA also assumes that the data is normally distributed. If the data is not normally distributed, the results of PCA can be unreliable.
Conclusion
In conclusion, PCA is a powerful and widely used technique in machine learning. It is a dimensionality reduction technique which is used to reduce the number of variables in a dataset while still preserving the most important information. PCA is a linear transformation technique which is used to reduce the number of features in a dataset while still preserving the most important information.
PCA is used in a wide range of applications including data compression, feature selection, feature extraction, and dimensionality reduction. It is a powerful tool for data analysis and can be used to identify patterns in data, reduce the complexity of data, and improve the accuracy of machine learning models.
Must Read
- Uncovering the Hidden Dangers of Multicollinearity.
- Unlock the Power of Feature Selection.
- Top 8 Cross Validation methods!!!.
- Non Linear Transformations.
- Feature Scaling : Data Normalization vs Data Standardization.
- Best Methods to Convert Categorical Data for Machine Learning.
API’s