Data Preprocessing using python

Data pre-processing is the process of preparing and cleaning of data for further analysis.  It is the first and key step in machine learning.  Data preprocessing phases cover all the activities to construct the final dataset. There are several techniques to pre-process the data, we can perform following operation on dataset.

  • Drop the feature that is irrelevant for the goal.
  • Find the duplicate Values in dataset
  • Identify the missing value and handle those value

import the required library and read the dataset.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.read_csv('credit_train.csv')
df.info() ## shows the data information like data count, data types
data preprocessing
data cleaning

we can use df.dtypes to check data types only.

Drop the feature that is irrelevant for the goal.

here loan id and customer id is not required for loan analysis so that we can drop that feature.

dataset.drop(labels=['Loan ID', 'Customer ID'], axis=1, inplace=True)

inplace true argument changes the dataset permanently, that means here loan id and customer id remove permanently from dataset. if you need to remove temporary then we can set inpalce=Flase.

Find the duplicate Values in dataset

we can use duplicated() function in pandas to identify the dublicate values in pandas dataset.

dataset.duplicated() 

Identify the missing value and handle those value

isnull is used to identify the null value in each columns/feature. and sum() use to find the total null in each columns.

dataset.isnull() #find the null values in columns

dataset.isnull().sum() # find the sum of the null values in columns.

after identifying the missing values, we can find the percentage of missing values in total dataset. There are several method to handle the missing values. The technique is based on dataset and what you are trying to find out from the model. Handling missing value with mean, Median and mode technique are mostly used in data science.

mean = dataset['Credit Score'].mean()
dataset['Credit Score'].fillna(mean, inplace=True)

here we calculate the mean value of credit score and replace the calculated value in all the null values.

similarly, we can replace median and mode value depend on your dataset value. while replace the value we need to careful about the dataset overfitting and underfitting cases.

In case of text/character we can replace by mode (most frequent text value in dataset) or one any constant value.

dataset.fillna('10+ years', inplace=True) # fill with '10+ years'.
missing_values_table(dataset)

In above dataset, ’10+ years’ has been replace with null values because most of the employee has more than 10 years work experience. it is mode replacement technique.

Please follow and like us:
error
fb-share-icon

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top