3.Data Pre-processing tasks using python with Data reduction techniques

PRASHIL VAISHNANI
5 min readNov 16, 2021

--

Details in datasets are wildly increasing. This might generate problem because of a lot of data main important feature in dataset may buried in useless data. So, this is the reason that Data pre-processing become crucial to any dataset. In this we are going to discuss different data reduction method that reduce unnecessary data from our dataset and make it more efficient for our model to run these datasets.

The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.

Dataset Used

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library

Loading the Iris dataset

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Adding noise

The dataset now has 8 features now. In that 4 feature are important and another 4 are noise.

Principal Component Analysis (PCA)

Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, increasing interpretability but at the same time minimizing information loss.

For a lot of machine learning applications it helps to be able to visualize your data. Visualizing 2 or 3 dimensional data is not that challenging. However, even the Iris dataset used in this part of the tutorial is 4 dimensional. You can use PCA to reduce that 4 dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better. we are going to use PCA methods for original data.

importing libraries

The dataframe after using StandardScalar

PCA Projection to 2D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation.

PCA for 2D projection

4 columns are converted to 2 principal columns

Concatenating DataFrame along axis = 1. resultant_Df is the final DataFrame before plotting the data.

Concatenating target column into dataframe

Now, lets visualize the dataframe:

2D representation of dataframe

PCA Projection to 3D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation

getting 3 principal component columns

Now lets visualize 3D graph,

3D graph

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here.

Variance Threshold

Univariate Feature Selection

Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called analysis of variance (ANOVA). When we analyze the relationship between one feature and the target variable, we ignore the other features. That is why it is called ‘univariate’. Each feature has its test score.
Finally, all the test scores are compared, and the features with top scores will be selected.

  1. f_classif

Also known as ANOVA,

ANOVA Test

2. chi2

This score can be used to select the features with the highest values for the test chi-squared statistic from data, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes.

chi2 Test

3. mutual_info_classif

Estimate mutual information for a discrete target variable.

Mutual information (MI)between two random variables is a non-negative value, which measures the dependency between the variables. It is equal to zero if and only if two random variables are independent, and higher values mean higher dependency.

The function relies on nonparametric methods based on entropy estimation from k-nearest neighbors distances.

mutual_info_classif Test

Recursive Feature Elimination

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. RFE requires a specified number of features to keep, however, it is often not known in advance how many features are valid.

RFE using Random Forest Classifier

Here, only original columns remains true and all other extra added noise column were shown false.

In this blog, we have seen how to use different feature selection methods on the same data and evaluated their performances.

--

--