
How to get the best accuracy using Feature Engineering?
In this blog, you are going to learn
1. What is Feature Engineering?
2. How to download the dataset?
3. How to explore the dataset?
4. How to check for null values for effective feature engineering?
5. How to handle numerical columns with null values?
6. How to handle categorical columns with null values?
7. How to fill null values in the dataset?
8. How to remove unwanted columns?
9. How to perform data normalization?
10. How to convert categorical columns to numerical?
11. How to split the dataset into training and testing?
12. how to implement a LightGBM model for classification?
Feature Engineering
Feature engineering directly influences the result of the model. A good amount of time spent on feature engineering can result in wonders. It involves a few steps which can be followed to obtain the desired results. These steps are broadly classified in the following points.1. Cleaning the Data.2. Processing the data.
3. Normalizing the data.
4. Model implementation and Prediction
1. Import the Libraries
import pathlib import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.preprocessing import OrdinalEncoder from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2
2. Download Dataset
We are going to download the Titanic dataset. In the dataset, we have 10 columns. In which “Survived” is our target column.
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data
head( ) : To give us top five results.
df=pd.read_csv("titanic_dataset.csv") df.head(5)
3. Explore the Dataset
info( ) method gives us the information about the columns in our DataFrame, their data types, the total memory consumption for the dataset.
dataset.info()

df.describe()

Let’s see data distribution for some of the columns in our DataFrame.
sns.pairplot(df[["Fare", "Pclass", "Survived"]], diag_kind="kde")

4. Checking for Null Values
info( ) : method gives us the count of the values in each column. It does not include null values. In this way, we can see if our dataset has null values.
df.info()

4.1 Handling Numerical Columns
There are a lot of ways to handle numerical null values mainly
1. Drop the rows with null values.
2. Fill the values with mean, median, or mode.
3. Use a machine learning algorithm like Linear Regression to fill the values.
4. Use an advanced approach like K-Nearest Neighbors to fill the missing values.
5. Using imputer to fill the missing values.
df['Age']=df['Age'].fillna(df['Age'].mean()) df=df.drop(columns=['Cabin']) df.head(5)

4.2 Handling Categorical Columns
To handle categorical columns we have fewer options such as
1. Drop the rows with missing values.
2. Populate the value using the most occurring value.
3. Impute the values using KNN imputer.
4. Fill the value using “Others”.
4.2.1 Filling Categorical Column Null Values
We are going to fill the missing values in the “Embarked” column with “Others”. we will use the fillna( ) method for the same.
df['Embarked']=df['Embarked'].fillna('Others')
5. Removing Unwanted Columns
Once we check every column, we can conclude that a few columns are unnecessary. Removing the unwanted columns helps in achieving better accuracy.
print("Name Column",len(df['Name'].unique())) print("Ticket Column",len(df['Ticket'].unique())) print("PassengerID Column",len(df['PassengerId'].unique()))

df=df.drop(columns=['Name','Ticket','PassengerId']) df.head(5)
6. Data Normalization
After data cleaning and pre-processing the next major step in feature, engineering is data normalization. We cannot give the values in the dataset directly to our model because with these values the model may or may not be able to converge.
6.1 Converting Categorical Columns
Categorical columns have to be converted into numerical columns. There are a lot of options methods for this
1. Label Encoder( ) : Encodes the target values with 0 -no_of_classes-1.
2. Ordinal Encoder( ) : Encodes the target column with an integer array.
3 One Hot Encoder ( ) : Encodes the target column into a one-hot numerical array.
In this example, we are going to use an Ordinal Encoder.
ordinalencoder=OrdinalEncoder() df[['Embarked','Sex']]=ordinalencoder.fit_transform(df[['Embarked','Sex']]) df[['Embarked','Sex']]

6.2 Normalizing Numerical Columns
Numerical columns have to be normalized
1. Standard Scaler( ) : Returns the value using (x-x[‘mean’])/x[‘std’] where x is the target column.
2. Min Max Encoder( ) : Transforms the value to a given range.
In this example, we are going to use a Standard Scaler.
standardccaler=StandardScaler() df[['Pclass','Age','SibSp','Parch','Fare']]=standardccaler.fit_transform(df[['Pclass','Age','SibSp','Parch','Fare']]) df[['Pclass','Age','SibSp','Parch','Fare']]

7. Splitting Data
Once the data is normalized then we need to split the data into training and testing datasets.
y=df.pop('Survived') X_train, X_val, y_train, y_val = train_test_split(df, y, test_size=0.05, random_state=42)
8. Model Implementation
.We are going to implement a Lightgbm Classifier model for this example. The advantage of using a Lightgbm model is that we can specify the categorical features by using the parameter “categorical_feature”.
In this way, the model doesn’t treat the categorical features in terms of numbers.
lgbmclassifier = LGBMClassifier() lgbmclassifier.fit(X_train,y_train,categorical_feature=['Embarked','Sex'])

predictions=lgbmclassifier.predict(X_val) predictions

print(classification_report(predictions,y_val))

Results
We are achieving an accuracy of 87% in this model.
Following the steps in feature engineering, we can achieve really good accuracy.