Data Pre-processing using Scikit-learn

PRASHIL VAISHNANI
4 min readSep 18, 2021

Data pre-processing is one technique of data mining using that you can convert your raw data into an understandable format. In his practical, we will take one dataset and performing the following task.

  1. Standardization
  2. normalization
  3. encoding
  4. discretization
  5. imputation of missing values.

We take one dataset that is Students Performance.

Dataset is here

Standardization

Data standardization is the process of bringing data into a uniform format. In this values are centred around the mean with a unit standard deviation.

from sklearn.preprocessing import StandardScalernumeric_columns = [c for c in data.columns if data[c].dtype != np.dtype('O')]
temp_data = data[numeric_columns]
standard_scaler = StandardScaler()
standardized_data = standard_scaler.fit_transform(temp_data)
pd.DataFrame(standardized_data , columns = temp_data.columns)

In this, we collect numerical values from the dataset and do Standardization that using “StandardScaler() “.

Normalization

Normalization is one technique of data mining. It will convert source data to a specific format so that will increase effectiveness. It will remove duplicate data and minimize data.

from sklearn.preprocessing import MinMaxScalernormalizer = MinMaxScaler()
normalized_data = normalizer.fit_transform(temp_data)
pd.DataFrame(normalized_data , columns = temp_data.columns)

Encoding

Using encoding convert categorical variables to binary or numerical counterparts.

List of encoding techniques

  • Label Encoding
  • One hot Encoding
  • Dummy Encoding
  • Effect Encoding
  • Binary Encoding
  • BaseN Encoding
  • Hash Encoding
  • Target Encoding

In this practical, we will learn about Label Encoding and One hot Encoding

Label Encoding

In Lable, Encoding is one technique of Encoding using that we can convert out label data to ordinal. This label converts into an integer value.

from sklearn.preprocessing import LabelEncoderle = LabelEncoder()
data['Status'] = le.fit_transform(data['Status'])
data['Status'].value_counts()

One hot Encoding

This is one encoding technique In this for each level of a categorical feature, we create a new variable. It will give according to that value like 0 or 1.

one_hot = OneHotEncoder()
transformed_data = one_hot.fit_transform(data['race/ethnicity'].values.reshape(-1,1)).toarray()
one_hot.categories_transformed_data = pd.DataFrame(transformed_data ,columns = ['math score', 'reading score'])
transformed_data.head()transformed_data.iloc[90 , ]data['race/ethnicity'][90]

In this practical, we take two-column from this dataset Area and BHK and do it according to using OneHotEncoder().

Discretization

Discretization is one technique using that we can convert continuous variables, models or functions into a discrete form.

for Discretization, we take 3 columns Math-Score, Reading-Score Writing-Score.

Uniform Discretization Transform

A uniform discretization transform will preserve the probability distribution of each input variable but will make it discrete with the specified number of ordinal groups or labels. We can apply the uniform discretization transform using the KBinsDiscretizer class.

KMeans Discretization Transform

A K-means discretization transform will attempt to fit k clusters for each input variable and then assign each observation to a cluster.

Quantile Discretization Transform

A quantile discretization transform will attempt to split the observations for each input variable into k groups, where the number of observations assigned to each group is approximately equal.

Imputation of missing values

Missing values do affect data analysis. So we have to handle that missing value. There is two way of handle missing value one is deleted that particular row from that set of add some value form prediction.

SimpleInputer is a class of sklearn using this we can handle missing values.

Here I am attach my github link .

https://github.com/prashil-vaishnani/data-science

--

--