Non Linear Transformations
In this blog, you are going to learn
- What are Non-linear Transformations?
- Why do we need Non-linear Transformations?
- How to apply Quantile Transformation?
- How to apply Power Transformation?
Non-linear transformations in machine learning are a powerful tool for data analysis and modeling. They allow us to explore complex relationships between variables and uncover hidden patterns in data. Non-linear transformations can be used to transform data into a more suitable form for analysis and to reduce the dimensionality of the data. They can also be used to improve the accuracy of machine learning models.
Why do we need Non-linear transformations?
Non-linear transformations are used to transform data into a more suitable form for analysis. This is done by applying a mathematical function to the data. The function can be linear or non-linear. Linear transformations are used to transform data into a more suitable form for analysis by changing the scale or range of the data. Non-linear transformations are used to transform data into a more suitable form for analysis by introducing non-linear relationships between variables.
Non-linear transformations can also be used to reduce the dimensionality of the data. Dimensionality reduction is the process of reducing the number of variables in a dataset. This can be done by applying a non-linear transformation to the data. The transformation can be used to reduce the number of variables by combining multiple variables into one or by removing redundant variables. This can help to reduce the complexity of the data and make it easier to analyze.
Non-linear transformations can also be used to improve the accuracy of machine learning models. This is done by applying a non-linear transformation to the data before it is fed into the model. The transformation can be used to introduce non-linear relationships between variables and to reduce the dimensionality of the data. This can help to improve the accuracy of the model by making it more sensitive to subtle changes in the data.
Non-linear transformations can also be used to improve the interpretability of machine learning models. This is done by applying a non-linear transformation to the data before it is fed into the model. The transformation can be used to introduce non-linear relationships between variables and to reduce the dimensionality of the data. This can help to make the model more interpretable by making it easier to understand how the model is making predictions.
Importing the Libraries
import pandas as pd import numpy as np from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import classification_report from sklearn.metrics import f1_score from sklearn.preprocessing import OrdinalEncoder from sklearn.pipeline import make_pipeline from sklearn.preprocessing import Normalizer from sklearn.preprocessing import QuantileTransformer from sklearn.preprocessing import PowerTransformer import warnings warnings.filterwarnings('ignore')
Data Set Information:
The output ‘area’ was first transformed with a ln(x+1) function. Then, several Data Mining methods were applied. After fitting the models, the outputs were post-processed with the inverse of the ln(x+1) transform. Four different input setups were used. The experiments were conducted using a 10-fold (cross-validation) x 30 runs. Two regression metrics were measured: MAD and RMSE. A Gaussian support vector machine (SVM) fed with only 4 direct weather conditions (temp, RH, wind and rain) obtained the best MAD value: 12.71 +- 0.01 (mean and confidence interval within 95% using a t-student distribution). The best RMSE was attained by the naive mean predictor. An analysis to the regression error curve (REC) shows that the SVM model predicts more examples within a lower admitted error. In effect, the SVM model predicts better small fires, which are the majority.
X – x-axis spatial coordinate within the Montesinho park map: 1 to 9
Y – y-axis spatial coordinate within the Montesinho park map: 2 to 9
month – month of the year: ‘jan’ to ‘dec’
day – day of the week: ‘mon’ to ‘sun’
FFMC – FFMC index from the FWI system: 18.7 to 96.20
DMC – DMC index from the FWI system: 1.1 to 291.3
DC – DC index from the FWI system: 7.9 to 860.6
ISI – ISI index from the FWI system: 0.0 to 56.10
temp – temperature in Celsius degrees: 2.2 to 33.30
RH – relative humidity in %: 15.0 to 100
wind – wind speed in km/h: 0.40 to 9.40
rain – outside rain in mm/m2 : 0.0 to 6.4
area – the burned area of the forest (in ha): 0.00 to 1090.84 (this output variable is very skewed towards 0.0, thus it may make sense to model with the logarithm transform).
ordinalencoder=OrdinalEncoder() ordinalencoder.fit(categorical_columns[['month','day']]) df[['month','day']]=ordinalencoder.transform(df[['month','day']])
Once we have transformed our column, we can have a look at the values.
Now we have prepared our dataset, the next step in the process is to have the training set and target column separate and to achieve that we are going to use the pop() method.
In the next step, we will split the data into training and testing sets using the train_test_split() method.
y=df.pop("area") X_train, X_test, y_train, y_test = train_test_split(df, y, random_state=42,test_size=0.1)
There are two types of Non-Linear Transformations available in Scikit-Learn
- Quantile Transformation
- Power Transformation
Now we will discuss both in detail.
It helps in transforming the data into a uniform distribution with a range of values between [0,1]. Scikit-Learn provides us QuantileTransformer to perform this operation on our data.
Algorithm: It is a three-step algorithm
- The data points are mapped to a uniform distribution.
- The values that are mapped in the above step are now mapped into the output distribution using a quantile function.
- New values will be transformed into the range fitted in the previous step.
quantiletransformer=QuantileTransformer() quantiletransformer.fit(X_train) X_train_quantiletransformer=X_train.copy() X_train_quantiletransformer[X_train.columns]=quantiletransformer.transform(X_train) X_train_quantiletransformer.head(5)
Running the code above will transform our dataset into a uniform distribution. Now let us check how the data has been transformed.
np.percentile(X_train['temp'], [0, 25, 50, 75, 100])
np.percentile(X_train_quantiletransformer['temp'], [0, 25, 50, 75, 100])
We can see that the transformed values have been uniformly distributed in the X_train_quantiletransformer data frame. Now we will implement a pipeline that will perform both of these steps for us.
pipe_quantiletransformer = make_pipeline(QuantileTransformer(), RandomForestRegressor()) pipe_quantiletransformer.fit(X_train, y_train) pipe_quantiletransformer.score(X_test, y_test)
Now we have seen the working of QunatileTransformer. Now let us discuss some of its features in detail.
- It helps in reducing the impact of outliers.
- It can be used even in the case of missing values.
- It is a non-linear transformation. If we have linear correlations between our variables, these would be affected.
When we want to transform our data as close to a Gaussian Distribution, we can use a power transformer. It transforms the data into a nearly Gaussian Distribution by standardizing the data into zero mean and unit variance.
Scikit-Learn provides us with two types of PowerTransformer
- Yeo-Johnson transforms: It only works on positive data points.
- Box-Cox transform: It works on both positive and negative data points.
In our example, we are going to use the Yeo-Johnson transform method since we have negative data points too.
powertransformer=PowerTransformer( method='yeo-johnson') powertransformer.fit(X_train) X_train_powertransformer=X_train.copy() X_train_powertransformer[X_train.columns.tolist()]=powertransformer.transform(X_train) X_train_powertransformer.head(5)
Running the above code resulted in transforming the data into Gaussian Distribution. We can also check if the scaled data has unit variance or not.
We can see that data has been properly transformed. Now we will create a pipeline to use PowerTransformer and implement a Logistic Regression model on our dataset.
pipe_powertransformer = make_pipeline(PowerTransformer(), RandomForestRegressor()) pipe_powertransformer.fit(X_train, y_train) pipe_powertransformer.score(X_test, y_test)
Non-linear transformations are a powerful tool for data analysis and modeling. They can be used to transform data into a more suitable form for analysis, reduce the dimensionality of the data, to improve the accuracy of machine learning models, and improve the interpretability of machine learning models. Non-linear transformations can be used to uncover hidden patterns in data and to make machine learning models more accurate and interpretable.