
6 Best Methods to Convert Categorical Data for Machine Learning
In this blog, you will learn
- What are categorical variables?
- What are the different types of categorical variables?
- How to implement Ordinal Encoding?
- How to implement One Hot Encoding?
- How to implement Binary Encoding?
- How to implement Backward Differencing Encoding?
- How to implement Hash Encoding?
- How to implement Target Encoding?
Categorical data is a type of data that consists of categories or labels. It is one of the most important types of data used in machine learning. Categorical data can be used to make predictions and decisions, and can also be used to measure relationships between different variables. In this blog, I will discuss how to convert categorical data in Scikit Learn.
The first step in converting categorical data is to identify the categories. For example, if you have a dataset containing data about students, you might have categories such as gender, grade level, and favorite color. Once you have identified the categories, you will need to convert them into numerical values.
Types of categorical variables:
- Nominal data: These are variables that have two or more categories, with no intrinsic ordering to the categories. Examples include gender, zip code, and eye color.
- Ordinal data: These are variables that have two or more categories with an intrinsic ordering. Examples include educational level (e.g. high school, college, graduate school) and satisfaction rating (e.g. very satisfied, satisfied, neutral, dissatisfied, very dissatisfied).
- Binary data: These are variables that have only two categories. Examples include yes/no responses, male/female, and alive/deceased.
These variables can be encoded using a variety of methods, we are going to learn about the following Encoding methods in this blog.
- Ordinal Encoding
- One- Hot Encoding
- Binary Encoding
- Backward Differencing Encoding
- Hash Encoding
- Target Encoding
In the first step, we are going to install the category_encoders library
Importing Libraries
import pandas as pd from sklearn.preprocessing import OrdinalEncoder from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.preprocessing import OneHotEncoder from sklearn.preprocessing import LabelEncoder from category_encoders import TargetEncoder from category_encoders import HashingEncoder from category_encoders import BackwardDifferenceEncoder from category_encoders import BinaryEncoder import warnings warnings.filterwarnings('ignore')
Data
We are going to work on Breast Cancer Dataset. This is one of three domains provided by the Oncology Institute that has repeatedly appeared in the machine learning literature. This data set includes 201 instances of one class and 85 instances of another class. The instances are described by 9 attributes, some of which are linear and some are nominal.
Attribute Information:
1. Class: no-recurrence-events, recurrence-events 2. age: 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99. 3. menopause: lt40, ge40, premeno. 4. tumor-size: 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59. 5. inv-nodes: 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39. 6. node-caps: yes, no. 7. deg-malig: 1, 2, 3. 8. breast: left, right. 9. breast-quad: left-up, left-low, right-up, right-low, central. 10. irradiat: yes, no.
Now we are going to download the dataset using pandas. The dataset will be downloaded as a pandas dataframe.
columns=['class','age','menopause','tumor_size','inv_nodes','node_caps', 'deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) df
Now we have the dataset, we are going to convert our target column “class” into a numerical column with a pandas lambda function.
df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) df['class']
categorical_data=df.select_dtypes(include=['object']) categorical_data
Ordinal Encoding
Ordinal Encoding: This method converts each unique categorical value into an integer value. Scikit-Learn’s Ordinal Encoder class provides us with the functionality of the Ordinal Encoder to convert the values in Python.
Let’s take an example to understand the concept if we have a column with 3 category values i.e. “Black”, “White”, and “Red”, an ordinal encoder will convert the values into 0, 1, and 2 i.e. the order in which the data has been received. We can specify the order using the “categories” parameter in the argument.
For this dataset, we will use an Ordinal Encoder and transform all the categorical columns into numerical columns.
df_oe=df.copy()
Steps to convert the data using an Ordinal Encoder
- Initialize the Ordinal Encoder.
- Train the encoder on the training set.
- Transform the data using the trained encoder.
Both steps can be reduced to a single step using the “fit_transform” method.
ordinalencoder=OrdinalEncoder() transformed_data=ordinalencoder.fit_transform(categorical_data) transformed_data
Now we have the transformed data, we are going to replace the categorical columns of the dataset with the new transformed numerical data.
df_oe[categorical_data.columns]=transformed_data df_oe
Once we have our finalized dataset, we are going to separate the target column from the data frame using the “pop” method.
y=df_oe.pop('class')
X_train, X_test, y_train, y_test = train_test_split(df_oe, y, test_size=0.1, random_state=42)
Once we have split the dataset into training and testing, we are going to Initialize a Random Forest Classifier model to check how the transformed data is performing.
clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) clf
predictions=clf.predict(X_test) predictions
We have made our predictions using the trained model. In the next step, we are going to check the model’s performance using the classification report. It will give our model’s performance based on Precision, Recall, and F1-Score.
print(classification_report(predictions,y_test))
Have a look at the complete code.
## Importing Libraries import pandas as pd from sklearn.preprocessing import OrdinalEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier ## Download the Dataset columns=['class','age','menopause','tumor_size','inv_nodes','node_caps','deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) ## Data Processing df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) categorical_data=df.select_dtypes(include=['object']) df_oe=df.copy() ## Encoding ordinalencoder=OrdinalEncoder() transformed_data=ordinalencoder.fit_transform(categorical_data) df_oe[categorical_data.columns ]=transformed_data ## Splitting Dataset y=df_oe.pop('class') X_train, X_test, y_train, y_test = train_test_split(df_oe, y, test_size=0.1, random_state=42) ## Model Implementation clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
Advantages:
- Ordinal encoding is a straightforward way to encode categorical variables.
- It is easy to use and understand since it assigns numerical values to categories in a way that preserves their relative order.
- It is a useful method for dealing with ordinal variables, where the values are not numeric but ordered in some meaningful way.
- It can be used for both supervised and unsupervised learning.
Disadvantages:
- It does not handle missing data well.
- It creates a bias towards the most common categories.
- It can lead to over-fitting if the model is not properly tuned.
- It does not take into account the importance of each category.
Since we are going to use the Random Forests Classifier model in all the encoding methods, we are going to create a function to perform this operation so that we do not need to write redundant code.
def create_model(df): y=df.pop('class') X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.1, random_state=100) clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
One Hot Encoding
One-hot encoding is a method of creating binary representations of categorical data. This is done by assigning a unique integer to each category. It is really helpful when there is no relationship between the categorical variables.
Scikit-Learn has provided us with OneHotEncoder class to implement this.
Let’s understand this concept with an example. If we have two values in the categorical column i.e. White and Black. One Hot Encoder would first sort the data and then assign binary representation to those values. Black would be represented as [1,0] and white as [0,1].
The number of columns would be equal to the unique values in the categorical columns.
df_ohe=df.copy() df_ohe
Let us try to implement One Hot Encoding on the “node_caps” column. In this column, we have three unique values i.e. “no” ,”yes” and “?”.
ohe = OneHotEncoder() ohe.fit_transform(df[['node_caps']]).toarray()[:3]
Now we are going to implement OneHotEncoding on all the categorical columns. Since the number of columns is going to increase we need to name all those columns. We are going to tackle this problem by taking all the unique values in a column and appending it to the column name.
For example in the above example, we have taken the “node_caps” column. We are going to create three new columns “node_capsno”, “node_capsyes” and “node_caps?”. We are going to apply the same strategy to all the columns.
for column_name in categorical_data.columns: temp_col=[] for value in categorical_data[column_name].unique(): temp_col.append(column_name + value) onehotencoder=OneHotEncoder() transformed_data=onehotencoder.fit_transform(categorical_data[[column_name]]).toarray() df_ohe[temp_col] =transformed_data df_ohe
As you can see that we have added the converted to the dataframe but our old columns also exists. We are going to drop the original categorical columns using the “drop” method.
df_ohe=df_ohe.drop(columns=categorical_data.columns) df_ohe
Now we are going to use the function we have created above to test our model performance.
create_model(df_ohe)
Using OneHotEncoder has increased our model accuracy from 66% to 72%. Now we will move forward with other techniques.
## Importing Libraries import pandas as pd from sklearn.preprocessing import OneHotEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier ## Download the Dataset columns=['class','age','menopause','tumor_size','inv_nodes','node_caps','deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) ## Data Processing df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) categorical_data=df.select_dtypes(include=['object']) df_ohe=df.copy() ## Encoding for column_name in categorical_data.columns: temp_col=[] for value in categorical_data[column_name].unique(): temp_col.append(column_name + value) onehotencoder=OneHotEncoder() transformed_data=onehotencoder.fit_transform(categorical_data[[column_name]]).toarray() df_ohe[temp_col] =transformed_data ## Splitting Dataset df_ohe=df_ohe.drop(columns=categorical_data.columns ) y=df_ohe.pop('class') X_train, X_test, y_train, y_test = train_test_split(df_ohe, y, test_size=0.1, random_state=42) ## Model Implementation clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
Advantages:
- One Hot Encoding is a fast and efficient way to represent categorical variables in machine learning algorithms.
- One Hot Encoding can be used to create more meaningful features from categorical variables, allowing for better model performance.
Disadvantages:
- One Hot Encoding can create a large number of new features, which can lead to a very large feature space and increase the complexity of the model.
- One Hot Encoding can also lead to data sparsity, where some features may be rarely used. This can lead to overfitting and poor generalization performance.
Binary Encoding
The binary Encoding technique is a combination of two steps. In the first step, the data is converted into an ordinal form and then it is converted into multiple binary columns. It is better than One Hot Encoding since the data takes less dimension than that.
The scikit-learn library provides the Binary Encoder class to apply Binary Encoding to one or more variables.
df_be=df.copy() df_be
Let’s encode the “age” column with Binary Encoding. To perform this operation
- We need to initialize a Binary Encoder
- Train the binary encoder and Transform the column.
binary_encoder=BinaryEncoder() transformed_data=binary_encoder.fit_transform(categorical_data['age']) transformed_data
Running the above code showed the age column converted into 3 numerical columns. We are going to perform the same operation for all the categorical columns.
for column_name in categorical_data.columns: binary_encoder = BinaryEncoder() transformed_data=binary_encoder.fit_transform(categorical_data[column_name]) df_be[transformed_data.columns] =transformed_data df_be
Running the code above converted all the categorical columns into numerical columns.
The original categorical columns are also present in the original dataset. We have to remove those columns to train the model. We are going to use the “drop( )” method for the same.
df_be=df_be.drop(columns=categorical_data.columns) df_be
Once have had our finalized dataset, we can use our function in order to Implement and Train our RandomForestClassifier Model to check the model performance.
create_model(df_be)
We can see that Binary Encoding is performing better than Label Encoding for this dataset.
Advantages:
- It works pretty well with a large number of categorical values.
- Memory Efficient.
Let’s have a look at the complete code.
## Importing Libraries import pandas as pd from category_encoders import BinaryEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier ## Download the Dataset columns=['class','age','menopause','tumor_size','inv_nodes','node_caps','deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) ## Data Processing df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) categorical_data=df.select_dtypes(include=['object']) df_be=df.copy() ## Encoding for column_name in categorical_data.columns: binary_encoder = BinaryEncoder() transformed_data=binary_encoder.fit_transform(categorical_data[column_name]) df_be[transformed_data.columns] =transformed_data ## Splitting Dataset df_be=df_be.drop(columns=categorical_data.columns) y=df_be.pop('class') X_train, X_test, y_train, y_test = train_test_split(df_be, y, test_size=0.1, random_state=42) ## Model Implementation clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
Backward Difference Encoding
Backward Difference Encoding is a technique used in scikit-learn for representing data as a sequence of numbers. This encoding is useful for representing data that is discontinuous or has a large range. Backward Difference Encoding works by taking the difference between each number in the sequence and the number that preceded it.
This Backward Difference Encoding transform is available in the scikit-learn library via the BackwardDifferenceEncoder class.
df_bde=df.copy() df_bde
Let us implement Backward Difference Encoding on the “breast” column. To perform this operation
- We need to initialize a Backward Difference Encoding.
- Train the Backward Difference Encoder and Transform the column.
bde = BackwardDifferenceEncoder() transformed_data=bde.fit_transform(categorical_data['breast']) transformed_data
Now we have checked how a Backward Difference Encoder works, we are going to convert all the categorical columns using the same method.
for column_name in categorical_data.columns: bde = BackwardDifferenceEncoder() transformed_data=bde.fit_transform(categorical_data[column_name]) df_bde[transformed_data.columns ] =transformed_data df_bde
Once we have converted all the columns, we can see that we still have the original categorical columns in the dataset. We need to remove those before sending the data to the estimator. To remove these columns we are going to use the “drop( )” method.
df_bde=df_bde.drop(columns=categorical_data.columns) df_bde
create_model(df_bde)
Let us have a look at the whole code.
## Importing Libraries import pandas as pd from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier from category_encoders import BackwardDifferenceEncoder ## Download the Dataset columns=['class','age','menopause','tumor_size','inv_nodes','node_caps','deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) ## Data Processing df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) categorical_data=df.select_dtypes(include=['object']) df_bde=df.copy() ## Encoding for column_name in categorical_data.columns : bde = BackwardDifferenceEncoder() transformed_data=bde.fit_transform(categorical_data[column_name]) df_bde[transformed_data.columns ] =transformed_data ## Splitting Dataset df_bde=df_bde.drop(columns=categorical_data.columns) y=df_bde.pop('class') X_train, X_test, y_train, y_test = train_test_split(df_bde, y, test_size=0.1, random_state=42) ## Model Implementation clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
Advantages:
- Backward Difference Encoding (BDE) is a powerful and efficient technique for encoding categorical variables in Scikit-Learn.
- It can be used to reduce the dimensionality of the data set, as it creates a new feature for each category of the variable.
- BDE is also useful when dealing with datasets with many categorical variables, as it helps to reduce the complexity of the model by removing redundant variables.
- BDE is also useful when dealing with datasets with a large number of categories since it can reduce the number of dummy variables created in the process.
Disadvantages:
- The disadvantage of BDE is that it can lead to information loss, as it is based on a monotonic transformation.
- BDE is also not suitable for variables with non-linear relationships, as it does not account for the non-linearities.
- BDE is also not suitable for data sets with high cardinality, as it can only be applied to a limited number of categories.
Hash Encoding
This encoding method uses a Hashing algorithm to convert a categorical column into a fixed-size numerical column. It’s an immutable process. Once we have converted the data, we cannot have our original data back.
Scikit-Learn has provided us with Hash Encoder class to perform this operation.
df_he=df.copy() df_he
Let’s move forward and convert the “age” column into a hashed numerical column.
Steps to convert the data using a HashingEncoder
- Initialize the Hashing Encoder.
- Train the encoder on the training set.
- Transform the data using the trained encoder.
Both steps can be reduced to a single step using the “fit_transform” method.
df_he['age'].unique()
We have discovered that the age column has 6 unique values. Henceforth we are going to convert the age column into fixed 6 hashed columns.
hashencoder=HashingEncoder(cols='age',n_components=6) transformed_data=hashencoder.fit_transform(df_he['age']) transformed_data
We have understood how a hashing encoder works. Now we will transform all the categorical columns into hashed columns. We are going to perform the below steps:
- Choose one categorical column.
- Select the total unique values in that column.
- Train and transform the selected column with n_components= Number of unique values in the column.
- Rename all the transformed columns to avoid confusion.
for column_name in categorical_data.columns : unique_components=len(categorical_data[column_name].unique()) hashencoder=HashingEncoder(cols=column_name,n_components=unique_components) transformed_data=hashencoder.fit_transform(df_he[column_name]) temp_col=transformed_data.columns transformed_data.columns=[column_name + col for col in temp_col] df_he[transformed_data.columns ] = transformed_data df_he
As you can see that now we have a total of 50 columns. We still have our original categorical columns in the dataset and we need to remove them using the “drop” method.
df_he=df_he.drop(columns=categorical_data.columns) df_he
create_model(df_he)
Advantages:
- A feature/column with 100 attributes can be converted into an N-sized feature.
Disadvantages:
- Loss of information since the data is being converted into a lesser dimension.
- Collision: This happens when multiple features represent the same value.
- Data cannot be converted back into the original form.
Let’s have a look at the complete code.
## Importing Libraries import pandas as pd from category_encoders import HashingEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier ## Importing Dataset columns=['class','age','menopause','tumor_size','inv_nodes','node_caps','deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) ## Data Processing df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) categorical_data=df.select_dtypes(include=['object']) df_he=df.copy() ## Encoding for column_name in categorical_data.columns : unique_components=len(categorical_data[column_name].unique()) hashencoder=HashingEncoder(cols=column_name,n_components=unique_components) transformed_data=hashencoder.fit_transform(df_he[column_name]) temp_col=transformed_data.columns transformed_data.columns=[column_name + col for col in temp_col] df_he[transformed_data.columns ] = transformed_data ## Splitting Dataset df_he=df_he.drop(columns=categorical_data.columns) y=df_he.pop('class') X_train, X_test, y_train, y_test = train_test_split(df_he, y, test_size=0.1, random_state=42) ## Model Implementation clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
Target Encoding
Target Encoding: This encoding technique is based on Bayesian Encoding Technique. The data is encoded based on the dependent and independent variables in the dataset.
The algorithm works as below:
- The mean of the target variable for each category is calculated.
- The category variable is replaced with the mean calculated in the step above.
- The posterior probability of the target categorical column is calculated and replaced with the original values.
We can use the TargetEncoder from scikit-learn to encode each variable to integers.
First, we are going to make a copy of the dataset and then we are going to perform Target Encoding on the copied dataset.
df_te=df.copy() df_te
Let us take an example. We are going to select the “menopause” and “class” columns from the dataset and then we will train a Target Encoder on these two columns.
df_te[['menopause','class']].head(5)
Now we are going to perform Target Encoding in the following steps
- Initialize a Target Encoder.
- Train the encoder on the columns “class” and “menopause” and convert the data.
target_encoder=TargetEncoder() target_encoder.fit_transform(df_te['menopause'],df_te['class'])
In a similar way, we are going to convert every categorical column in the dataset.
for col in df_te.select_dtypes(include=['object']).columns: target_encoder=TargetEncoder() df_te[col]=target_encoder.fit_transform(df_te[col],df_te['class']) df_te
Now we have our finalized dataset, we can send the data to the “create_model” function to get the classification report.
create_model(df_te)
Disadvantages:
- It may lead to model overfitting. Regularization (L1 and L2) can be used to handle this.
- Biased data can lead to extreme values.
Let’s take a look at the complete code for the TargetEncoder.
## Importing Libraries import pandas as pd from category_encoders import TargetEncoder from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report from sklearn.ensemble import RandomForestClassifier ## Importing Dataset columns=['class','age','menopause','tumor_size','inv_nodes','node_caps','deg_malig','breast','breast_quad','irradiat'] df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer/breast-cancer.data",names=columns) ## Data Processing df['class']=df['class'].apply(lambda x: 0 if x=='no-recurrence-events' else 1) categorical_data=df.select_dtypes(include=['object']) df_te=df.copy() ## Encoding for col in df_te.select_dtypes(include=['object']).columns: target_encoder=TargetEncoder() df_te[col]=target_encoder.fit_transform(df_te[col],df_te['class']) ## Splitting Dataset y=df_te.pop('class') X_train, X_test, y_train, y_test = train_test_split(df_te, y, test_size=0.1, random_state=42) ## Model Implementation clf = RandomForestClassifier() clf = clf.fit(X_train, y_train) predictions=clf.predict(X_test) print(classification_report(predictions,y_test))
Summary
- Ordinal Encoding: This method converts each unique categorical value into an integer value
- One-Hot Encoding: One-hot encoding is a method of creating binary representations of categorical data.
- Binary Encoding: In the first step, the data is converted into an ordinal form, and then it is converted into multiple binary columns.
- Backward Differencing Encoding: It takes the difference between each number in the sequence and the number that preceded it.
- Hash Encoding: This technique uses a hashing technique to encode the data.
- Target Encoding: This encoding technique is based on Bayesian Encoding Technique.
Must Read
- Uncovering the Hidden Dangers of Multicollinearity.
- Unlock the Power of Feature Selection.
- Top 8 Cross Validation methods!!!.
- Non Linear Transformations.
- Feature Scaling : Data Normalization vs Data Standardization.
- Best Methods to Convert Categorical Data for Machine Learning.
API’s
- Ordinal Encoder class
- OneHotEncoder
- Binary Encoder class
- BackwardDifferenceEncoder
- Hash Encoder class
- TargetEncoder