
Feature Scaling : Data Normalization vs Data Standardization
In this blog, you will learn
- What is Data Normalization?
- What is Feature Scaling?
- What are the different types of feature scaling?
- What is Min-Max Normalization?
- What is Z-Score Normalization?
- What is Data Standardization?
- What is Robust Scaling?
Feature Scaling
Feature scaling is an important step in the preprocessing of data for machine learning algorithms. It is a technique used to transform the range of independent variables or features of data to a common scale. Feature scaling is also known as data normalization or standardization.
The goal of feature scaling is to ensure that all the features of the data are on the same scale so that the model can learn from the data better. Without feature scaling, certain machine learning algorithms may not perform as expected. For example, the K-Nearest Neighbors algorithm relies on the distance between two points to determine the class of a data point. If the features are on different scales, then the distance between the two points will be dominated by the feature with the larger scale.
Why do we need to perform Feature Scaling?
Feature scaling can also help to reduce the time required to train a model. Since the features are on the same scale, the model can focus on the most important features and ignore the features that have a smaller impact on the outcome. This can help to reduce the amount of time required to train the model.
Feature scaling can also help to reduce the effect of outliers in the data. Outliers are data points that have values that are significantly different from the majority of the data points. If the features are on different scales, then the outliers will have a larger impact on the model. By scaling the features, the outliers will have less of an impact on the model.
Types of Feature Scaling
There are several different types of feature scaling that can be used. The most common type of feature scaling is min-max scaling, which is also known as normalization. Min-max scaling transforms the data so that all the values are between 0 and 1. This is done by subtracting the minimum value from each data point and then dividing by the range of the data.
Another type of feature scaling is standardization. Standardization transforms the data so that the mean of the data is 0 and the standard deviation is 1. This is done by subtracting the mean from each data point and then dividing by the standard deviation.
Finally, there is a type of feature scaling known as robust scaling. Robust scaling is similar to min-max scaling, but it is more robust to outliers. Instead of subtracting the minimum value from each data point, the median value is subtracted. This helps to reduce the effect of outliers on the data.
No matter which type of feature scaling is used, it is important to remember that the data should be scaled before the model is trained. If the data is not scaled before the model is trained, then the model may not perform as expected.
Now we will learn about each category of feature scaling in depth. For this blog, we are going to work on an Internet firewall dataset.
Importing the Libraries
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import classification_report from sklearn.metrics import f1_score from sklearn.preprocessing import OrdinalEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler from sklearn.preprocessing import MaxAbsScaler from sklearn.preprocessing import RobustScaler
Importing dataset
Internet Firewall Data Data Set
Data Set Information:
There are 12 features in total. The action feature is used as a class. There are 4 classes in total. These are allowed, action, drop, and reset-both classes.
df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv") df.head(5)
In order to work on the data, we need to convert the categorical column into a numerical column. We have multiple options to perform this operation.
- One Hot Encoding: It encodes the values into binary values. The string values are taken and sorted in alphabetical order and then converted into binary values. It can be achieved using the OneHotEncoder in Scikit-LearnLet’s take an example. Suppose we have three values: Apple, pear, and Banana. OnehotEncdoing will work in the following steps.
- The string value will be sorted in alphabetical order. Apple, Banana, and Pears.
- Binary values will be created for each string value. Apple will be given [1,0,0], Banana will be given a binary value of [0,1,0] and Pears will be given a value of [0,0,1].
Note The placement of 1 decides the value.
2. Ordinal Encoding: This method converts each unique categorical value into an integer value. Scikit-Learn’s OrdinalEncoder class provides us to convert the values.
In the previous example, if we use Ordinal Encoding, the values will be converted into 0.0,1.0,2.0. There is no specific ordering provided for that.
For this dataset, we will use an Ordinal Encoder and transform our target column ‘Action’ into a numerical one. Before using the OridnalEncoder, we need to convert the 1D array into a 2D array. For this, we will use np. reshape() method.
ordinalencoder=OrdinalEncoder() df['Action']=ordinalencoder.fit_transform(np.reshape(df['Action'].values,(-1,1))) df.head(5)
Once we have transformed our column, we can have a look at the values.
df['Action'].unique()
Now we have prepared our dataset, the next step in the process is to have the training set and target column separate and to achieve that we are going to use the pop() method.
In the next step, we will split the data into training and testing sets using the train_test_split() method.
y=df.pop("Action") X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42,test_size=0.1)
Data Normalization
Data normalization can be used to reduce the complexity of the data by removing outliers, transforming the data into a more consistent format, and reducing the number of variables. It can also be used to reduce noise by removing irrelevant data points and reducing the amount of data. Normalization can also be used to make the data more consistent by transforming the data into a more uniform format.
Types of Data Normalization Methods
The most common type of data normalization is min-max normalization. This process involves transforming all data points to the same range, usually between 0 and 1. This is done by subtracting the minimum value from each data point and then dividing by the range (maximum value minus minimum value). This ensures that all data points are on the same scale, which can help to improve the accuracy of the machine learning algorithm.
Another type of data normalization is z-score normalization which is also called as Data Standardization. This process involves transforming data points to have a mean of 0 and a standard deviation of 1. This is done by subtracting the mean from each data point and then dividing it by the standard deviation. This ensures that the data points are on the same scale and that the machine learning algorithm is not biased toward any particular data point.
Why is Data Normalization necessary?
Data normalization is an important step in the preprocessing of data before it is used in a machine learning algorithm. It helps to reduce the effect of outliers and noise on the data, and it can help to improve the accuracy of the machine learning model. There are several different types of data normalization, and each one has its own advantages and disadvantages. It is important to choose the right type of data normalization for your data in order to ensure that the machine learning algorithm is not biased towards any particular data point.
Data normalization is also important for feature selection. Feature selection is the process of choosing the most relevant features for the machine learning algorithm. By normalizing the data, we can ensure that all features are treated equally and that the machine learning algorithm is not biased toward any particular feature. This can help to improve the accuracy of the machine learning model.
Min-Max Normalization
Each attribute is scaled to a range between [0,1]. This can be achieved using Scikit-Learn’s MinMaxScaler. The transformation is given by
X_scaled = (X - X_min) / (X_max - X_min)
Let us take an example to understand this properly. Let’s assume we have a data frame with attribute values [10,20,30]. Now we will apply the MinMaxScaler to this attribute
X_min=10
X_max=30
10-> (10-10)/(30-10):0.0
20-> (20-10)/(30-10):0.5
30-> (30-10)/(30-10):1.0
As we can see the minimum value can be 0 and the maximum value can be 1.0. In our case, we are going to apply MinMaxScaler to our dataset to scale our features.
Algorithm: There are three steps involved in this process
- Initialize the Min-Max Scaler: The first step is to use the Min-Max Scaler class and initialize the Min-Max Scaler for the dataset.
- Fit the Min-Max Scaler with the fit() method: Once, we have initialized the Min-Max Scaler we need to find the minimum and maximum values for every attribute. This is achieved by using the fit method. Once the Min-Max Scaler is fitted it can be used for future purposes too.
- Transform the data with the fitted Min-Max Scaler: Fit method calculates the minimum and maximum of every column and the transform method applies it to every column of the data provided.
min_max_scaler=MinMaxScaler() min_max_scaler.fit(X_train) X_train_minmax=X_train.copy() X_train_minmax[X_train.columns]=min_max_scaler.transform(X_train) X_train_minmax.head(5)
We can see that our data has been scaled in a range of [0,1]. We could also check the maximum values for each attribute and see how these values have been scaled.
X_train.max()
Before the Normalization, each attribute has a different range of values.
X_train_minmax.max()
After Normalization, each attribute has been scaled down in the range of [0,1]. To reduce the steps to perform two operations :
- Transform the data using a Min-Max Scaler
- Initialize and train a Logistic Regression Model
We are going to use a Pipeline. A pipeline will allow us to implement multiple steps using a single pipeline.
Steps to implement a Pipeline:
- Initialize the make_pipeline() method and mention the steps in the order you want them to be executed.
- Use the fit() method to execute the steps.
- Use the score() method to calculate your model’s performance.
pipe_minmax = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=2000)) pipe_minmax.fit(X_train, y_train) pipe_minmax.score(X_test, y_test)
Let us try to combine these elements together, the complete example is listed below.
#Import Libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import OrdinalEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import MinMaxScaler #Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv") ordinalencoder=OrdinalEncoder() df['Action']=ordinalencoder.fit_transform(np.reshape(df['Action'].values,(-1,1))) y=df.pop("Action") X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42,test_size=0.1) # Feature Scaling pipe_minmax = make_pipeline(MinMaxScaler(), LogisticRegression(max_iter=2000)) pipe_minmax.fit(X_train, y_train) pipe_minmax.score(X_test, y_test)
Running the example calculates the model performance after performing Min-Max Normalization on the dataset.
Note: Your results may vary if you are using a different dataset or if you are using a different model with optimized parameters.
Z-Score Normalization or Data Standardization
Z score normalization, also known as standard score normalization, is a process used in machine learning to transform a given set of data points so that they have a mean of zero and a standard deviation of one. This process is also known as standardization, and it is a key step in many machine-learning algorithms. The purpose of z-score normalization is to make the data more consistent and easier to interpret, as well as to make it easier to compare different datasets.
Z-Score Algorithm
To understand z score normalization, it is important to first understand what a z score is. A z score is a measure of how far a given data point is from the mean of a dataset. It is calculated by subtracting the mean of the dataset from the data point and then dividing the result by the standard deviation of the dataset. A result is a number that indicates how many standard deviations the data point is away from the mean. A positive z score indicates that the data point is above the mean, while a negative z score indicates that the data point is below the mean.
Now that we understand what a z score is, let’s look at how it is used in z score normalization. The process begins by calculating the mean and standard deviation of the dataset. Then, for each data point, the z score is calculated by subtracting the mean from the data point and dividing the result by the standard deviation. This results in a number that indicates how many standard deviations the data point is away from the mean.
The next step is to transform the data points so that they have a mean of zero and a standard deviation of one. This is done by subtracting the mean from each data point and then dividing the result by the standard deviation. This results in a new set of data points that have a mean of zero and a standard deviation of one.
Finally, the transformed data points are used in the machine learning algorithm. This process is often used to make the data more consistent and easier to interpret. It can also make it easier to compare different datasets.
Z score normalization is a key step in many machine learning algorithms. It is used to make the data more consistent and easier to interpret, as well as to make it easier to compare different datasets. The process begins by calculating the mean and standard deviation of the dataset. Then, for each data point, the z score is calculated by subtracting the mean from the data point and dividing the result by the standard deviation. This results in a number that indicates how many standard deviations the data point is away from the mean. Finally, the data points are transformed so that they have a mean of zero and a standard deviation of one. This process makes the data more consistent and easier to interpret, as well as makes it easier to compare different datasets.
Z-Score Implementation
Scikit-learn provides us with StandardScaler which standardizes the features for us.
The z-score of a sample value x is given by:
z=(x - u)/s.
where u is the mean and s is the standard deviation.
Algorithm: There are three steps involved in this process
- Initialize the Standard Scaler: The first step is to use the StandardScaler class and initialize the Standard Scaler for the dataset.
- Fit the Standard Scaler with the fit() method: Once, we have initialized the Standard Scaler we need to find the values of mean and standard deviation for every attribute. This is achieved by using the fit method. Once the Standard Scaler is fitted it can be used for future purposes too.
- Transform the data with the fitted standard scaler: Fit method calculates the mean and standard deviation of every column and the transform method applies it to every column of the data provided.
In the example, we will initialize the Standard Scaler, fit it on our training set and transform our testing set to check the score.
standard_scaler = StandardScaler() standard_scaler.fit(X_train) X_train_sscaler=X_train.copy() X_train_sscaler[X_train.columns]=standard_scaler.transform(X_train) X_train_sscaler.head(5)
Running the above code gives us a scaled dataset. Each attribute has been standardized to zero mean and unit standard variation.
print("Mean value before Standardizationn") print(X_train.mean())
print("Mean value after Standardizationn") print(round(X_train_sscaler.mean(axis=0),5))
We can see that the mean of the attributes before using the standard Scaler covers a range of values. Most of the estimators expect the mean to be centered at 0 and unit variance. If any of the features have variance other than that, that feature would be given more importance compared to the others. This can lead to biased results since more importance would be given to that specific feature.
Just like the mean, we can also check the standard deviation before and after applying the standardization.
print("Variance before Standardizationn") print(X_train.std(axis=0))
print("Variance after Standardizationn") print(round(X_train_sscaler.std(axis=0),4))
All the above steps can be reduced into two lines by using a Pipeline. A pipeline allows a combination of several steps which can be cross-validated together by tweaking a range of parameters.
In this example we are creating a pipeline that will have standardization as the first step and then modeling would be the next one.
pipe_standard_scaler = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000)) pipe_standard_scaler.fit(X_train, y_train)
Once the pipeline has been created we can check how the model is performing on the test set by using the score() method which will calculate the F1 Score on the test set.
pipe_standard_scaler.score(X_test, y_test)
Running the above code gives us a score of 98%. This tells us that standardizing the dataset has helped our model to learn better. Let’s have a look at the complete code.
#Import Libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import OrdinalEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler #Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv") ordinalencoder=OrdinalEncoder() df['Action']=ordinalencoder.fit_transform(np.reshape(df['Action'].values,(-1,1))) y=df.pop("Action") X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42,test_size=0.1) # Feature Scaling pipe_standard_scaler = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000)) pipe_standard_scaler.fit(X_train, y_train) pipe_standard_scaler.score(X_test, y_test)
MaxAbsScaler
It similarly scales the data but the scaled data lies in a range of [-1,1]. It achieves it by dividing each value by the maximum value of the respective attribute. We can use Scikit-learn’s MaxAbsScaler to perform this operation.
The transformation is given by
X_scaled = X/X_max
Let us take an example to understand this properly. Let’s assume we have a data frame with attribute values [10,20,30]. Now we will apply the MaxAbsScaler to this attribute
X_max=30
10-> 10/30= 0.33
20-> 20/30= 0.66
30-> 30/30= 1.0
Advantages:
- Useful in sparse data i.e. when a lot of data points are zero.
Algorithm: There are three steps involved in this process
- Initialize the Max Absolute Scaler: The first step is to use the Max Absolute Scaler class and initialize the Max Absolute Scaler for the dataset.
- Fit the Max Absolute Scaler with the fit() method: Once, we have initialized the Standard Scaler we need to find the maximum value for every attribute. This is achieved by using the fit method. Once the Max Absolute Scaler is fitted it can be used for future purposes too.
- Transform the data with the fitted Max Absolute Scaler: Fit method calculates the maximum value of every column and the transform method applies it to every column of the data provided and returns the scaled data.
Now we will scale our data using MaxAbsScaler and see check its performance using F1 Score.
max_abs_scaler=MaxAbsScaler() max_abs_scaler.fit(X_train) X_train_max_abs=X_train.copy() X_train_max_abs[X_train.columns]=max_abs_scaler.transform(X_train) X_train_max_abs.head(5)
Running the example would result in scaling using MaxAbsScaler. Now we can check the performance of this transformation by training a Logistic Regression model and checking our performance using the F1 Score.
pipe_maxabs = make_pipeline(MaxAbsScaler(), LogisticRegression(max_iter=2000)) pipe_maxabs.fit(X_train, y_train) pipe_maxabs.score(X_test, y_test)
Note: We can see that the performance of our model has decreased as compared to standardizing the data using a Standard Scaler. This may vary depending on the dataset you are using.
#Import Libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import OrdinalEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import MaxAbsScaler #Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv") ordinalencoder=OrdinalEncoder() df['Action']=ordinalencoder.fit_transform(np.reshape(df['Action'].values,(-1,1))) y=df.pop("Action") X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42,test_size=0.1) # Feature Scaling pipe_maxabs = make_pipeline(MaxAbsScaler(), LogisticRegression(max_iter=2000)) pipe_maxabs.fit(X_train, y_train) pipe_maxabs.score(X_test, y_test)
Robust Scaling
Normalization is important for machine learning estimators. It is achieved by removing the mean and scaling the variance to a unit. But when outliers are present in the data, this could negatively influence the model. In these scenarios working with median and interquartile range give a better result.
RobustScaler removes the median of the attributes and scales the data between the 1st quartile (25th quantile) and the 3rd quartile(75th quantile).
Scikit-Learn provides us with RobustScaler to scale the features.
Disadvantages:
- It cannot be used on sparse data points.
Algorithm: There are three steps involved in this process
- Initialize the Robust Scaler: The first step is to use the Max Absolute Scaler class and initialize the Max Robust Scaler for the dataset.
- Fit the Robust Scaler with the fit() method: Once, we have initialized the Robust Scaler we need to find the median and quantiles for every attribute. This is achieved by using the fit method. Once the Robust Scaler is fitted it can be used for future purposes too.
- Transform the data with the fitted Robust Scaler: Fit method calculates the median and quantiles of every column and the transform method applies it to every column of the data provided and returns the scaled data.
In this example, we are going to scale our data using a RobustScaler and see whether it is impacting our model’s performance or not.
robust_scaler=RobustScaler() robust_scaler.fit(X_train) X_train_robust=X_train.copy() X_train_robust[X_train.columns]=robust_scaler.transform(X_train) X_train_robust.head(5)
Running the above code with result in scaling the features with RobustScaler and removing the median from the attributes. Let us check if the median has been removed from the data.
print("Before applying RobustScaler") X_train.median(axis=0)
print("After applying RobustScaler") X_train_robust.median(axis=0)
We can see that after applying the RobustScaler the median has been removed from the attributes. In the next step, we are going to create a pipeline that will perform RobustScaler standardization and model training in steps.
pipe_robust = make_pipeline(RobustScaler(), LogisticRegression(max_iter=2000)) pipe_robust.fit(X_train, y_train) pipe_robust.score(X_test, y_test)
From running the above code we can see that our model has been achieving the highest score so far.
Note: This may or may not be applicable to different datasets. You need to check what kind of scaling is working best for you.
Exercise: Try to use different kinds of scalers on your dataset and comment on which one is giving the best score.
#Import Libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import OrdinalEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import RobustScaler # Data Processing df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00542/log2.csv") ordinalencoder=OrdinalEncoder() df['Action']=ordinalencoder.fit_transform(np.reshape(df['Action'].values,(-1,1))) y=df.pop("Action") X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42,test_size=0.1) # Feature Scaling pipe_robust = make_pipeline(RobustScaler(), LogisticRegression(max_iter=2000)) pipe_robust.fit(X_train, y_train) pipe_robust.score(X_test, y_test)
Questions
Why do we need Data standardization?
- Data standardization is also important for reducing the risk of overfitting. Overfitting occurs when a model is too closely fitted to the training data, resulting in poor performance on unseen data. Standardizing the data helps to reduce the risk of overfitting by making the data more consistent across different datasets.
- Data standardization helps in improving the accuracy of machine learning models. Standardizing the data helps to ensure that the data is consistent across different datasets, making it easier for the model to learn. This can help to improve the accuracy of the model, as well as its generalizability.
- Finally, data standardization is important for reducing the time and resources required to train a machine-learning model. Standardizing the data helps to reduce the amount of time and resources required to train a model, as the data is already in a consistent format. This can help to reduce the cost of training a model, as well as the time it takes to train it.
Posts
Summary
Feature scaling is an essential step in data preprocessing for machine learning algorithms. It is a technique used to transform the range of independent variables or features of data to a standard scale. Feature scaling can help to reduce the time required to train a model, reduce the effect of outliers in the data, and ensure that all the features of the data are on the same scale so that the model can learn from the data better. No matter which type of feature scaling is used, it is important to remember that the data should be scaled before the model is trained.
Data normalization is an important step in the preprocessing of data before it is used in a machine learning algorithm. It helps to reduce the effect of outliers and noise on the data, and it can help to improve the accuracy of the machine learning model. There are several different types of data normalization, and each one has its own advantages and disadvantages. It is important to choose the right type of data normalization for your data in order to ensure that the machine learning algorithm is not biased towards any particular data point.
Data standardization is an important step in the machine-learning process. It is used to transform data into a common format that can be used by different algorithms and models. Standardization helps to ensure that data is consistent across different datasets, making it easier to compare and analyze. The process of standardization can involve a variety of techniques, such as normalization, scaling, and discretization. Data standardization is also important for reducing the risk of overfitting, improving the accuracy of machine learning models, and reducing the time and resources required to train a model.