
Feature Importance using XGBoost
In this blog, you are going to learn
1. What is Feature Importance?2. How to Import the dataset?
3. How to process the dataset for the machine learning model?
4. How to convert categorical data into numerical data?
5. How to split the data into testing and training datasets?
6. How to implement an XGBoost machine learning model?
7. How to predict output using a trained XGBoost model?
8. How to find feature importance using the XGBoost model?
9. How to build an XGboost Model using selected features?
XGboost Model
1. Import Libraries
import pathlib import numpy as np import pandas as pd from xgboost import XGBClassifier from matplotlib import pyplot import seaborn as sns import matplotlib.pyplot as plt from sklearn.preprocessing import OrdinalEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report
2. Import Dataset
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00459/avila.zip !unzip avila.zip
Once we have the Pandas DataFrame, we can use inbuilt methods such as
read_csv( ) : To read a CSV file into a pandas DataFrame.
columns_names=['intercolumnar_distance','upper_margin','lower_margin','exploitation','row_number','modular',\ 'interlinear_spacing','weight','peak_number','modular_ratio','class'] dataset = pd.read_csv('avila/avila-ts.txt',names=columns_names,na_values = "?", comment='\t', sep=",", skipinitialspace=True) dataset
3. Data Processing
Once we have the dataset, we need to build the training data i.e. X, and the target variable i.e. y.
y=dataset[['class']] dataset=dataset.drop(columns=['class']) X=dataset
The drop function removes the column from the dataframe.
4. Convert Categorical Columns to Numerical
To convert the categorical data into numerical, we are using an Ordinal Encoder. Ordinal Encoder assigns unique values to a column depending upon the unique number of categorical values present in that column.
For example, if a column has two values [‘a’,’b’], if we pass the column to Ordinal Encoder, the resulting column will have values[0.0,1.0]. Were 0.0 represents the value ‘a’ and 1.0 represents the value b.
ordinalencoder = OrdinalEncoder() ordinalencoder.fit(y) y = ordinalencoder.transform(y) y=y.flatten() y
5. Split Data
We are using Scikit-Learn train_test_split( ) method to split the data into training and testing data. The “test_size” parameter determines the split percentage.
X_train, X_val, y_train, y_val = train_test_split(dataset, y, test_size=0.05, random_state=42)
6. Model Implementation
To implement a XGBoost model for classification, we will use XGBClasssifer( ) method.
model = XGBClassifier() model.fit(X_train, y_train)
prediction=model.predict(X_val) print(classification_report(prediction,y_val))
8. Feature Importance
Feature Importance is defined as the impact of a particular feature in predicting the output. We can find out feature importance in an XGBoost model using the feature_importance_ method.
indices = np.argsort(model.feature_importances_)[::-1] features = [] for i in range(10): features.append(X.columns[indices[i]]) fig, ax = plt.subplots(figsize=(15,5)) sns.barplot(x=features, y=model.feature_importances_[indices[range(10)]],\ label="Importtant Categorical Features", palette=("Blues_d"),ax=ax).\ set_title('Categorical Features Importance') ax.set(xlabel="Columns", ylabel = "Importance")
Visualizing the results of feature importance shows us that “peak_number” is the most important feature and “modular_ratio” and “weight” are the least important features.
9. Model Implementation with Selected Features
We know the most important and the least important features in the dataset. Now we will build a new XGboost model using only the important features.
X_new=X[['intercolumnar_distance','upper_margin','lower_margin','exploitation','row_number','modular',\ 'interlinear_spacing','peak_number']]
X_train_new, X_val_new, y_train_new, y_val_new = train_test_split(X_new, y, test_size=0.05, random_state=42)
model_sel_features = XGBClassifier() model_sel_features.fit(X_train_new, y_train_new)
prediction=model_sel_features.predict(X_val_new) print(classification_report(prediction,y_val_new))
10. Results
We see that using only the important features while training the model results in better Accuracy. Hence feature importance is an essential part of Feature Engineering.
Summary
1. drop( ) : To drop a column in a data frame.2. OrdinalEncoder( ): To convert categorical data into numerical data.3. train_test_split( ): How to split the data into testing and training datasets?
4. XGBClassifier( ) : To implement an XGBoost machine learning model.
5. predict( ): To predict output using a trained XGBoost model.
6. feature_importances_ : To find the most important features using the XGBoost model.
7. classification_report( ) : To calculate Precision, Recall and Accuracy.