
Unlock the Power of Feature Selection
In this blog, we are going to learn
* What is Feature Selection?
* What are the different types of feature selection methods?
* How to select features by removing features with Low Variance?
* How to perform Univariate feature selection?
* How to perform Recursive feature elimination?
* How to select features using SelectFromModel?
* How to perform Sequential Feature Selection?
Feature selection
Feature selection is an important step in the machine learning process. It involves selecting the most relevant features from a dataset that can be used for training a machine learning model. Feature selection helps to reduce the complexity of the model, improve its accuracy and reduce the time taken to train the model. Feature selection also helps to reduce the risk of overfitting, as it reduces the number of features that the model has to learn from.
In this blog, we will discuss the various feature selection methods in machine learning. We will look at the different types of feature selection methods, their advantages and disadvantages, and when to use them. We will also discuss the importance of feature selection in machine learning and how it can improve the performance of a machine learning model.
What is Feature Selection?
Feature selection is the process of selecting the most relevant features from a dataset that can be used for training a machine learning model. It is an important step in the machine learning process as it helps to reduce the complexity of the model, improve its accuracy and reduce the time taken to train the model. Feature selection also helps to reduce the risk of overfitting, as it reduces the number of features that the model has to learn from.
Feature selection is also known as variable selection, attribute selection or variable subset selection. It is a process of selecting the most relevant features from a dataset that can be used for training a machine learning model.
Types of Feature Selection Methods
There are several different types of feature selection methods that can be used in machine learning. These include filter methods, wrapper methods, embedded methods and hybrid methods.
Filter Methods
Filter methods are one of the most commonly used feature selection methods in machine learning. Filter methods are based on the properties of the data and do not require any training. They are usually used as a pre-processing step before training a machine learning model.
Filter methods use statistical measures to evaluate the relevance of each feature. The most commonly used measures are correlation, information gain, chi-squared and mutual information. The features with the highest scores are selected for training the model.
Filter methods are fast and easy to implement, but they are not very accurate. They also do not take into account the interactions between the features.
Wrapper Methods
Wrapper methods are a type of feature selection method that uses a machine learning model to evaluate the relevance of each feature. The model is trained on the dataset with all the features and then the features are evaluated based on the performance of the model. The features with the highest scores are selected for training the model.
Wrapper methods are more accurate than filter methods, as they take into account the interactions between the features. However, they are time consuming and computationally expensive.
Embedded Methods
Embedded methods are a type of feature selection method that uses a machine learning model to select the most relevant features. The model is trained on the dataset with all the features and then the features are evaluated based on the weights of the model. The features with the highest weights are selected for training the model.
Embedded methods are more accurate than filter methods, as they take into account the interactions between the features. They are also faster and less computationally expensive than wrapper methods.
Hybrid Methods
Hybrid methods are a type of feature selection method that combines two or more feature selection methods. The most common hybrid methods are the combination of filter and wrapper methods, or the combination of filter and embedded methods.
Hybrid methods are more accurate than filter and wrapper methods, as they take into account the interactions between the features. They are also faster and less computationally expensive than wrapper methods.
Importance of Feature Selection
Feature selection is an important step in the machine learning process. It helps to reduce the complexity of the model, improve its accuracy and reduce the time taken to train the model. Feature selection also helps to reduce the risk of overfitting, as it reduces the number of features that the model has to learn from.
Feature selection is also important for improving the interpretability of a machine-learning model. By selecting the most relevant features, it is easier to understand the model and explain its predictions.
Feature Selection Methods:
- Removing features with low variance
- Univariate feature selection
- Recursive feature elimination
- Feature selection using SelectFromModel
- Sequential Feature Selection
Importing Libraries
import numpy as np import pandas as pd
Dataset
Cervical Cancer Behavior Risk Data Set: The dataset contains 20 attributes regarding ca cervix behavior risk with the class label ca_cervix with 1 and 0 as values which means the respondent with and without ca cervix, respectively.
Attribute Information:
Attribute Information:
This dataset consist of 18 attribute (comes from 8 variables, the name of variables is the first word in each attribute)
1) behavior_sexualRisk
2) behavior_eating
3) behavior_personalHygine
4) intention_aggregation
5) intention_commitment
6) attitude_consistency
7) attitude_spontaneity
8) norm_significantPerson
9) norm_fulfillment
10) perception_vulnerability
11) perception_severity
12) motivation_strength
13) motivation_willingness
14) socialSupport_emotionality
15) socialSupport_appreciation
16) socialSupport_instrumental
17) empowerment_knowledge
18) empowerment_abilities
19) empowerment_desires
20) ca_cervix (this is class attribute, 1=has cervical cancer, 0=no cervical cancer)
df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") df
y=df.pop('ca_cervix')
Feature Selection Methods
Removing features with Low Variance
Feature selection with low variance is a method of feature selection that involves removing features with low variance from a dataset. Variance is a measure of how much a feature’s values vary from the mean. Features with low variance have a small range of values and are not very useful for predicting the outcome of a model. Thus, it is important to remove such features from the dataset before building a model.
Advantages
Feature selection with low variance is important for several reasons. Firstly, it helps reduce the complexity of the model by removing redundant features. Redundant features are those that do not contribute to the prediction of the outcome of the model. By removing such features, the model can be more accurate and efficient.
Secondly, it helps reduce the amount of data required for training and testing. By removing features with low variance, the dataset becomes smaller and can be more easily handled. This reduces the time and resources required for training and testing the model.
Thirdly, it helps improve the accuracy of the model. By removing features with low variance, the model is less likely to overfit and can make more accurate predictions.
Finally, it helps improve the interpretability of the model. By removing features with low variance, the model becomes easier to interpret and understand. This can help in understanding the underlying relationships between the features and the output.
How to Implement Feature Selection with Low Variance?
There are several ways to implement feature selection with low variance. The most common approach is to use the variance threshold method. This method involves setting a threshold for the variance of the features. All features with a variance below the threshold are removed from the dataset.
Another approach is to use a supervised learning algorithm. In this approach, a supervised learning algorithm is used to select the most relevant features from the dataset. The algorithm uses the labels of the dataset to determine which features are most important for predicting the output.
Now let us take an example and understand this concept in detail. By default, it removes all the features with the same values. Let us assume we have a dataset in which we have 2 values [1,2]. Now we want to remove all the features that are either 1 or 2 in more than 60% of the samples. We can do that by setting a threshold. Here we are going to take the variance threshold as 0.6(1-0.6). Variance is given by the formula where p is the probability
variance[x]=p(1-p)
X = [[2, 2, 1], [2, 1, 2], [1, 2, 2], [2, 1, 1], [2, 1, 2], [2, 1, 1]]
In the sample provided above
- The variance of 1 in the first column is 1/6=0.167 or 16.7% and the variance of 2 is 5/6=0.833 or 83.3%.
- In the 2nd column variance of 1 is 4/6= 0.667 or 66.7% and the variance of 2 is 2/6=0.33 or 33% .
- In the 3rd column variance of 1 is 3/6= 0.5 or 50% and the variance of 2 is 3/6=0.5 or 50%.
We are selecting a Threshold of 70%. We will remove the columns where the variance of 1 or 2 is greater than 70%.
from sklearn.feature_selection import VarianceThreshold sel = VarianceThreshold(threshold=(.7 * (1 - .7))) sel.fit_transform(X)
Running the code above, we can see that the first column has been removed since the variance of 2 was 83.3%.
Univariate feature selection
Univariate feature selection is a technique used in machine learning to select the best features from a dataset. It works by selecting the features with the highest correlation to the target variable. This technique is useful for reducing the dimensionality of a dataset and improving the accuracy of a model.
Feature Selection Methods
- SelectKBest: Select the K highest scoring attributes.
- SelectPercentile: Selects the K highest scoring percentage of attributes
- GenericUnivariateSelect: Helps to select the best univariate test selections with hyper-parameter search functionality.
Univariate feature selection works by evaluating each feature individually and determining how much information it contains about the target variable. The feature with the highest correlation to the target variable is selected. This process is repeated until all features have been evaluated.
Statistical tests
- For regression
- r_regression: Calculates Pearson’s r co-relation for each attribute and the target variable.
- f_regression: Calculates the F score and returns the calculated p-value.
- mutual_info_regression: Calculates mutual information between variables.
- For classification:
- chi2: Performs Chi-squared test
- f_classif: Performs ANOVA F-value.
- mutual_info_classif: Calculates mutual information between variables.
The advantage of univariate feature selection is that it is simple and easy to implement. It can also be used as a pre-processing step before applying more complex feature selection techniques.
The main disadvantage of univariate feature selection is that it does not consider the interactions between features. This means that it may select features that are not the best predictors of the target variable. For this reason, it is important to use other feature selection techniques in combination with univariate feature selection.
Now we are going to understand each method in detail.
SelectKBest
Now we are going to Select K best features from the Cervical Cancer Behavior Risk Data Set. The dataset has a total of 19 features. We are going to use the chi2 method to find the dependence between the stochastic variables. Once we have that we are going to select the top 10 features with the help of SelectKBest.
Working
- Initialize SelectKBest: Initialize the SelectKBest method and provide the type of Statistical test you want to use. In the second parameter provide the value of K.
- Fit and Transform: The fit_transform method first performs the test and then selects the top k features from the dataset.
from sklearn.feature_selection import SelectKBest, chi2 selectkbest = SelectKBest(chi2, k=10) transformed_data=selectkbest.fit_transform(df, y) transformed_data[:10]
Running the code above we have performed a chi2 test on the Cervical Cancer Behavior Risk Data Set and have selected the top 10 features. To check the names of the features/attributes we are going to use the method “get_feature_names_out”.
selectkbest.get_feature_names_out()
These are the top 10 features that have a high dependency between the variables and are dependent on the target variable. Now let us look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.feature_selection import SelectKBest, chi2 # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection selectkbest = SelectKBest(chi2, k=10) transformed_data=selectkbest.fit_transform(df, y) selectkbest.get_feature_names_out()
SelectPercentile
We have understood the concept of SelectKBest, now we are going to see SelectPercentile. We are going to use the f_classif method to calculate the F-value and then select the highest-scoring percentage of features.
- Initialize SelectPercentile: Initialize the SelectPercentile method and provide the type of Statistical test you want to use. The second parameter provides the value of the percentage.
- Fit and Transform: The fit_transform method first performs the test and then selects the highest scoring percentage of the features.
from sklearn.feature_selection import SelectPercentile, f_classif selectpercentile = SelectPercentile(f_classif, percentile=40) transformed_data=selectpercentile.fit_transform(df, y) transformed_data[:10]
Running the code above we have performed the f_classif test on the Cervical Cancer Behavior Risk Data Set and have selected the 8 best scoring features. To check the names of the features/attributes we are going to use the method “get_feature_names_out”.
selectpercentile.get_feature_names_out()
These are the top 8 features that have the best score. Let us look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.feature_selection import SelectPercentile, f_classif # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection selectpercentile = SelectPercentile(f_classif, percentile=40) transformed_data=selectpercentile.fit_transform(df, y) selectpercentile.get_feature_names_out()
GenericUnivariateSelect
Let us understand the working of GenericUnivariateSelect. In this example, we are going to use mutual_info_classif to calculate the mutual information between the variables.
- Initialize GenericUnivariateSelect: Initialize the SelectPercentile method and provide the type of Statistical test you want to use. In the mode parameter, you can select from {‘percentile’, ‘k_best’, ‘fpr’, ‘fdr’, ‘fwe’}. The last parameter param defines the number of features you want to retain.
- Fit and Transform: The fit_transform method first performs the test and then selects the best k features mentioned in the param
from sklearn.feature_selection import GenericUnivariateSelect, mutual_info_classif genericunivariateselect = GenericUnivariateSelect(mutual_info_classif, mode='k_best', param=10) transformed_data=genericunivariateselect.fit_transform(df, y) transformed_data[:10]
genericunivariateselect.get_feature_names_out()
These are the top 10 features.
Note: The value of the parameters can change depending on the test selection method and the dataset. Let us look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.feature_selection import GenericUnivariateSelect, mutual_info_classif # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection genericunivariateselect = GenericUnivariateSelect(mutual_info_classif, mode='k_best', param=10) transformed_data=genericunivariateselect.fit_transform(df, y) genericunivariateselect.get_feature_names_out()
Recursive feature elimination
Recursive Feature Elimination (RFE) is a feature selection method used in machine learning. It is a type of backward elimination technique that is used to select the most important features from a dataset. It works by recursively removing attributes and building a model on those attributes that remain. It uses the model accuracy to identify which attributes (and combination of attributes) contribute the most to predicting the target attribute.
RFE starts by building a model on all available features and then iteratively removes the weakest feature (or combination of features) until the desired number of features is reached. The feature (or combination of features) that contributes the least to the model accuracy is removed at each iteration. The process is repeated until the desired number of features is reached.
RFE is a useful tool for feature selection as it helps to identify the most important features in a dataset. It is also useful for reducing the complexity of a model and improving its accuracy. Furthermore, it can be used to reduce the number of features in a dataset, which can speed up the training process and reduce the risk of overfitting.
Let us understand this with an example. We are going to train a Gradient Boosting Classifier model and then use RFE to select the best features.
Working:
- Initialize Machine Learning Model: Initialize the machine learning model depending on the type of problem you are working with.
- Initialize RFE: Specify the model initialized in the estimator section then specify the number of features you want to select and a number of features to specify in each step.
- Fit: This method will remove the unimportant features and select the top k features specified in the “n_features_to_select” parameter.
In this example, we are going to select the top 10 features with step=2.
from sklearn.feature_selection import RFE from sklearn.ensemble import GradientBoostingClassifier model = GradientBoostingClassifier() rfe = RFE(estimator=model, n_features_to_select=10, step=2) rfe.fit(df, y) print("Selected Features",rfe.n_features_)
Running the code above we have selected the top 10 features of the Cervical Cancer Behavior Risk Data Set based on the REF method.
rfe.get_feature_names_out()
Let us look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.feature_selection import RFE from sklearn.ensemble import GradientBoostingClassifier # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection model = GradientBoostingClassifier() rfe = RFE(estimator=model, n_features_to_select=10, step=2).fit(df,y) rfe.get_feature_names_out()
Recursive feature elimination with cross-validation
The drawback of REV is we have to specify the number of features to be selected. This can be resolved using the REF Cross-Validation method. It is similar to the REF method, it has some more parameters that help to automate the feature selection process.
- Scoring: The scoring parameter specifies the scoring function to perform cross-validation.
- min_features_to_select: This specifies the minimum number of features that would be selected.
In this example, we are taking “accuracy” as the scoring parameter and 8 minimum features. Now let us run the code to perform REF Cross-validation.
from sklearn.feature_selection import RFECV rfecv = RFECV( estimator=model, step=2, scoring="accuracy", min_features_to_select=8, n_jobs=-1, ) rfecv.fit(df, y) print("Selected Features",rfecv.n_features_)
Running the code above the REFCV method has selected 14 features out of 19. To find which are those attributes we are going to use the “get_feature_names_out” method.
rfecv.get_feature_names_out()
Feature selection using SelectFromModel
SelectFromModel works by evaluating the importance of each feature in the dataset. It uses a variety of techniques such as L1 regularization, L2 regularization, and decision trees to determine the importance of each feature. The features that are deemed to be important are then selected for use in the model.
The advantage of using SelectFromModel is that it can be used with any type of machine-learning model. This means that it can be used with both supervised and unsupervised learning algorithms. Additionally, it can be used with both regression and classification models.
SelectFromModel also has the advantage of being able to select features that are important for the model and ignore features that are not important. This helps to reduce the complexity of the model and improve its accuracy of the model.
When using SelectFromModel, it is important to understand the underlying algorithms that are used to evaluate the importance of each feature. This will help to ensure that the features that are selected are the most important for the model. Additionally, it is important to understand the parameters that can be used to control the selection process.
Types
- Regularization-based feature selection
- Tree-based feature selection
Regularization-based feature selection
Linear models with L1 regularization can perform feature selection as it forces some of the attributes to be zero. This can be used with SelectFromModel to perform feature selection of the coefficients that are non-zero. Now let us understand this with an example.
- Initialize a linear model with the L1 norm.
- Specify SelectFromModel with a linear model to select the non-zero coefficients.
from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModel model = LogisticRegression(penalty='l1',solver='liblinear') regularization_based = SelectFromModel(estimator=model).fit(df, y) transformed_data=regularization_based.transform(df) transformed_data[:10]
regularization_based.get_feature_names_out()
Let us look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.linear_model import LogisticRegression from sklearn.feature_selection import SelectFromModel # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection model = LogisticRegression(penalty='l1',solver='liblinear') regularization_based = SelectFromModel(estimator=model).fit(df, y) transformed_data=regularization_based.transform(df) regularization_based.get_feature_names_out()
Tree-based feature selection
- Initialize a tree-based machine learning algorithm to compute impurity-based feature importances.
- Specify SelectFromModel to discard irrelevant features.
In this example, we are going to use Gradient Boosting Classifier. Once we have initialized the model, then we will use SelectFromModel to remove irrelevant features.
from sklearn.ensemble import GradientBoostingClassifier from sklearn.feature_selection import SelectFromModel model = GradientBoostingClassifier() tree_based = SelectFromModel(estimator=model).fit(df,y) data_transformed = tree_based.transform(df) tree_based.get_feature_names_out()
Let us have a look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.ensemble import GradientBoostingClassifier from sklearn.feature_selection import SelectFromModel # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection model = GradientBoostingClassifier() tree_based = SelectFromModel(estimator=model).fit(df,y) data_transformed = tree_based.transform(df) tree_based.get_feature_names_out()
Sequential Feature Selection
Sequential feature selection is a powerful tool used in machine learning to identify the most important features in a dataset. It is a type of feature selection algorithm that works by selecting features one at a time and evaluating their importance in the model. The algorithm starts with an empty set of features and then adds one feature at a time, based on the evaluation criteria, until all features have been added.
Sequential feature selection is a useful tool for reducing the complexity of a dataset and improving the accuracy of a model. It can be used to identify the most important features in a dataset and to reduce the number of features used in a model. This can help to reduce overfitting and improve the accuracy of the model.
Sequential feature selection algorithms can be divided into two categories: forward selection and backward selection. In forward selection, the algorithm starts with an empty set of features and then adds one feature at a time, based on the evaluation criteria, until all features have been added. In backward selection, the algorithm starts with all features and then removes one feature at a time, based on the evaluation criteria, until only the most important features remain.
The evaluation criteria used in sequential feature selection algorithms can vary depending on the type of problem being solved. Common evaluation criteria include accuracy, precision, recall, and F1 score.
In this example, we are going to initialize a Logistic Regression model. Then we are going to use Sequential Feature Selector to select the 10 best features by selecting one feature at a time and going forward. Now let us go through the example.
from sklearn.feature_selection import SequentialFeatureSelector from sklearn.linear_model import LogisticRegression model = LogisticRegression(max_iter=3000).fit(df, y) forward_sfs = SequentialFeatureSelector(model, n_features_to_select=10, direction="forward").fit(df, y) forward_sfs
We have fitted our SequentialFeatureSelector. Now we can select the names of the features using the get_support() method.
df.columns[forward_sfs.get_support()]
These are the top 10 features selected using the Sequential Feature Selection method. The same process can be done using the backward process.
Note: The value of the parameters can change depending on the test selection method and the dataset. Let us look at the complete code.
#Import Libraries import numpy as np import pandas as pd from sklearn.feature_selection import SequentialFeatureSelector from sklearn.linear_model import LogisticRegression # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00537/sobar-72.csv") y=df.pop('ca_cervix') #Feature Selection model = LogisticRegression(max_iter=3000).fit(df, y) forward_sfs = SequentialFeatureSelector(model, n_features_to_select=10, direction="forward").fit(df, y) df.columns[forward_sfs.get_support()]
Questions
What is Feature Importance in Machine Learning?
Feature importance is a process of identifying and ranking the most important features in a dataset. It is a measure of how much each feature contributes to the model’s performance. Feature importance helps us to identify the most important features that can be used to train the model and make predictions. This is especially useful when dealing with large datasets that contain hundreds or thousands of features.
Feature importance can be measured using various methods such as correlation analysis, information gain, and Gini importance. Correlation analysis measures the linear relationship between two features. Information gain measures the decrease in entropy when a feature is added to the model. Gini importance measures the decrease in impurity when a feature is added to the model.
Why is Feature Importance in Machine Learning Important?
Feature importance is an important concept in machine learning. It helps us to identify the most important features that can be used to train the model and make predictions. By understanding feature importance, we can reduce the complexity of the problem and focus on the most important features that have the greatest impact on the model’s performance.
Feature importance also helps us to identify redundant features that can be removed from the dataset. This can improve the performance of the model by reducing the number of features and reducing the complexity of the problem. Feature importance can also help us to identify features that can be used to create new features that can improve the performance of the model.
What are some challenges with Feature Importance?
Although feature importance is a powerful tool for improving the performance of machine learning models, there are some challenges associated with it. One of the biggest challenges is that feature importance is not always reliable. The results of feature importance can be affected by the choice of model, the data preprocessing steps, and the hyperparameter tuning.
Another challenge is that feature importance can be misleading. It is possible to have a feature that is highly important in one model but not important in another. This can lead to overfitting or underfitting of the model.
API’s
- sklearn.feature_selection.VarianceThreshold
- sklearn.feature_selection.SelectKBest
- sklearn.feature_selection.SelectPercentile
- sklearn.feature_selection.GenericUnivariateSelect
- sklearn.feature_selection.RFE
- sklearn.feature_selection.RFECV
- sklearn.feature_selection.SelectFromModel
- sklearn.feature_selection.SequentialFeatureSelector
Must Read
- 6 Best Methods to Convert Categorical Data for Machine Learning
- Feature Scaling : Data Normalization vs Data Standardization
- Non Linear Transformations
- Top 8 Cross Validation methods
Summary
Feature selection with low variance is an important step in the machine-learning process. It helps reduce the complexity of the model, reduce the amount of data required for training and testing, improve the model’s accuracy, and improve the interpretability of the model. There are several ways to implement feature selection with low variances, such as using the variance threshold method, using a supervised learning algorithm, and using a feature selection algorithm.
Univariate feature selection is a useful technique for reducing the dimensionality of a dataset and improving the accuracy of a model. However, it should not be used as the only feature selection technique. It should be used in combination with other techniques to ensure the best results.
Recursive Feature Elimination is a powerful tool for feature selection in machine learning. It is used to identify the most important features in a dataset and reduce the complexity of a model. It can also be used to reduce the number of features in a dataset, which can speed up the training process and reduce the risk of overfitting.
SelectFromModel is a powerful feature selection technique that can be used with any type of machine learning model. It can be used to select the most important features for the model, reduce the complexity of the model, and improve its accuracy of the model. Understanding the underlying algorithms and parameters that control the selection process is important for successful feature selection.
Sequential feature selection is an effective tool for reducing the complexity of a dataset and improving the accuracy of a model. It can identify the most important features in a dataset and reduce the number of features used in a model. This can help to reduce overfitting and improve the accuracy of the model.