Pandas Corr() to find the most important numerical features.
In this blog, you are going to learn
1. How to import the Pandas library?
2. How to Import the dataset?
3. How to explore the dataset?
4. How to use Pandas Corr( ) method to calculate important feature importance?
5. How to build a Heatmap based on the co-relation calculated using Pandas Corr( ) method.
6. How to extract important features using the P-Value?
1. Import the Libraries
import pathlib import numpy as np import pandas as pd import tensorflow as tf import keras import seaborn as sns import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier
2. Import the Dataset
We are importing the Breast cancer dataset. We are getting the data in .data format, so we need to convert the data into a pandas data frame.
path = keras.utils.get_file("breast-cancer-wisconsin.data", "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data") path
Once we have the Pandas DataFrame, we can use inbuilt methods such as
Dropna( ) : To drop the rows with null values.
Drop( ) : To drop a specific column in a DataFrame.
column_names = ['id','Clump_Thickness','Cell_Size','Cell_Shape','Marginal_Adhesion', 'Epithelial_Cell_Size','Bare_Nuclei','Bland_Chromatin','Normal_Nucleoli','Mitoses','Class'] dataset = pd.read_csv(path, names=column_names, na_values = "?", comment='\t', sep=",", skipinitialspace=True) dataset=dataset.dropna() dataset=dataset.drop(columns=['id'])
3. Explore the Dataset
We can see that we have a total of 10 columns in our dataset. In which 9 columns are of integer type and 1 column is of float type.
The class column is the target Column.
dataset['Class']=dataset['Class'].apply(lambda x: 0 if x==2 else 1) y=dataset.pop("Class") X=dataset.copy()
We will use Pandas corr( ) method which gives the Pearson Co-relation between each variable.
Heatmaps give the important factor in the dataset. Seaborn provides really good methods to implement beautiful heatmaps with various colors options.
fig, ax = plt.subplots(figsize=(20,10)) sns.heatmap(co_relation, xticklabels=co_relation.columns, yticklabels=co_relation.columns, annot = True,ax=ax)
6. Important Features using Pandas Corr( )
Once we have the co-relation between every features we can find out the top n features we want to have in our training dataset.
co_relation_target = abs(co_relation['Class']) important_features = co_relation_target[co_relation_target>0.8] important_features
We can select important features by specifying the P-Value threshold. In our example, we have a selection threshold value of 0.8.
1. corr( ) : To calculate the correlation between every feature in the dataset.
2. heatmap( ) : To build and visualize the results of the Pandas corr ( ) method in the form of a heatmap.