
Best categorical features in the dataset.
In this blog, you are going to learn
1. How to import the necessary libraries?
2. How to Import the dataset?
3. How to explore the dataset?
4. How to convert Categorical Columns to Numerical Columns using Ordinal Encoding?
5. How to select the best Categorical Features using the SelectKBest algorithm?
6. How to plot a bar graph?
1. Import the Libraries
import pathlib import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2
2. Import the Dataset
We are importing the Car Data dataset. We are getting the data in .data format, so we need to convert the data into a pandas data frame.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
df=pd.read_csv("titanic_dataset.csv") df.head(5)
3. Explore the Dataset
We can see that we have a total of 7 columns in our dataset. In which all columns are of the object type.
The safety column is the target Column.
dataset.info()
dataset['safety'].unique()
Our Target column has four unique string values. We need to convert our target column into a numerical one.
def categorical_to_numericla(value): if value=='unacc': return(0) elif value=='acc': return(1) elif value=='good': return(2) else: return(3) dataset['safety']=dataset['safety'].apply(lambda x: categorical_to_numericla(x))
We have used a lambda function to convert the categorical features into a numerical one. We have defined a function “categorical_to_numerical” which returns numerical values based on the string value it gets.
y=dataset.pop('safety') X=dataset
4. Convert Categorical features into Numerical features
The input to the Ordinal Encoder is categorical features. The features are then converted into ordinal integers from values (0-number of categories -1).
ordinalencoder = OrdinalEncoder() ordinalencoder.fit(X) X_transformed = ordinalencoder.transform(X) X_transformed
5. SelectKBest Implementation
The Chi-square test removes the irrelevant features from the dataset which are not dependent on the target column.
SelectKBest algorithm selects the categorical features based on the best K scores in our case it is based on the chi-square function.
selectkbest = SelectKBest(score_func=chi2, k=3) selectkbest.fit(X_transformed, y) best_columns = selectkbest.transform(X) best_columns
6. Visualize with Barplot
We can visualize the results using a bar graph to get better insights. We are going to plot the categorical features according to the decreasing order of their importance.
indices = np.argsort(selectkbest.scores_)[::-1] features = [] for i in range(6): features.append(X.columns[indices[i]]) fig, ax = plt.subplots(figsize=(20,10)) sns.barplot(x=features, y=selectkbest.scores_[indices[range(6)]],\ label="Importtant Categorical Features", palette=("Blues_d"),ax=ax).\ set_title('Categorical Features Importance') ax.set(xlabel="Columns", ylabel = "Importance")