top of page
Search

Data Massaging

  • Writer: Sarita Upadhya
    Sarita Upadhya
  • Jul 24
  • 5 min read

For any machine learning task, the first step is to ensure that the data being considered is clean, reliable, aligned with the business objective to be achieved, taken from the recent past reflecting the current business scenario thus validating that it is fit for the purpose it is being used for. However, in real life situations data quality is usually a challenge, may be because it is collected manually, not standardized as per compliance or underlying rules, corrupted during data migration or many other reasons. Hence, checking the data and massaging it before proceeding with further analysis or implementation becomes crucial.

In this article we will pick up data from the finance domain which is available here. Further we will identify data quality issues and try and fix them using different possible techniques.

About Data:

Data is published by MIT on Kaggle and the objective is to build a classification model to predict credit defaulters. There are 12 columns or features in the dataset and all are numeric data types. Let us massage this data, identify all possible challenges, and arrive at appropriate cleanup solutions. Data has 150000 observations and each observation corresponds to a customer who has either defaulted or not defaulted.

Field Name

Description

ID

Unique Identifier

Default

Binary variable to indicate if a customer has defaulted (1 indicates default and 0 indicates no default). Target variable to be predicted.

CC_Usage_last_month

Percentage of credit card usage in the last month

Age

Age of the customer

Num_loans_payment_delays_3_months

Number of loans with payment delays of more than 3 months in the last 3 years

Debt-to-income_ratio

Financial debt-to-income ratio

Monthly income amount

Monthly income amount

Num_loans_and_Num_cc_last_year

Number of loans (including vehicle or mortgage loans) and number of credit cards last year

Num_loans_payment_delays_60_days

Number of times the client has been over 60 days late in the last 3 years

Num_mortgage_loans

Number of mortgage loans

Num_loans_payment_delays_30_days

Number of payment delays over 30 days in the last 3 years

Num_dependents

Number of dependents

Data Quality Issues:


Observations from Statistical Summary:

 

Mean

std

min

25%

50%

75%

max

CC_Usage_last_month

6.05

249.75

0.0

0.03

0.15

0.56

50708

Age

52.3

14.77

0.0

41

52

63

109

Num_loans_payment_delays_3_months

0.42

4.19

0.0

0

0

0

98

Debt-to-income_ratio

353.00

2037.81

0.0

0.18

0.37

0.87

329664

Monthly income amount

6670.22

14384.67

0.0

3400

5400

8249

3008750

Num_loans_and_Num_cc_last_year

8.45

5.15

0.0

5

8

11

58

Num_loans_payment_delays_60_days

0.27

4.17

0.0

0

0

0

98

Num_mortgage_loans

1.02

1.13

0.0

0

1

2

54

Num_loans_payment_delays_30_days

0.24

4.16

0.0

0

0

0

98

Num_dependents

0.76

1.12

0

0

0

1

20

Below is the list of observations for every field:

Field Name

Observation

CC_Usage_last_month

The maximum value of “CC_Usage_last_month” seems to be very high. As it is a percentage field, anything beyond a value of 100 seems to be incorrect.

Age

The minimum value in “Age” field is 0 which is incorrect or bad data.

Debt-to-income_ratio

The maximum value of “Debt-to-income_ratio” is too high. 75% of the records lie in the range of 0 through 0.87. Valid values should be between 0 through 1

Monthly income amount

· Minimum value of “Monthly income amount” is 0 which seems incorrect.

· Maximum value of “Monthly income amount” is too high.

Missing Values:

Field Name

Count of Null values

Monthly income amount

29731 null values out of 150000 records

Num_dependents

3924 null values out of 150000 records

Imbalance in Data (Classification Problem):


A classification model can be trained well, if the data is balanced. That is, there should be near to equal number of records in each class. If the data is balanced, the model being trained will have a good representation of data belonging to each class and will be able to learn the difference between the them effectively.

In the data we are working on, “Default” field is the target variable which has to be predicted using a classification technique and is binary in nature. So, we need to check if there is an equal representation of data for defaulters (value of 1) and non-defaulters (value of 0).

It is observed that 139974 records almost 93% of the total data belong to non-defaulters (value of 0) and only ~7% of the data belongs to defaulter class (value of 1) which we are interested in. Hence, there is a huge imbalance in the data which must be addressed.


Observations from Data Exploration:


  1. 223 records have “CC_Usage_last_month” greater than value of 100.

  2. 35065 records have “Debt-to-income_ratio” greater than value of 1.


Boxplot of "Monthly Income Amount"
Boxplot of "Monthly Income Amount"

There are 4 customers with “Monthly income amount” more than 1000000

  1. 165 records with “Monthly income amount” less than 100 including 0 values.

  2. 1 record with “Age” field as 0.


Suggestion to Fix the Data Quality Issue:

Field

Issues

Impact

Suggestion

CC_Usage_last_month

The maximum value in “CC_Usage_last_month” seems to be very high. As it is a percentage field, anything beyond a value of 100 seems to be incorrect.

223 records

(0.14% of total records)

Can be separated from the original dataset and shared with business for correction.

Age

The minimum value in “Age” field is 0 which is incorrect or bad data.

1 record

As it is just 1 record, we can replace 0 with median value 52 to this record.

Debt-to-income_ratio

The maximum value of “Debt-to-income_ratio” is too high.

35065 records with value greater than 1 (23%)

As the number of bad records is high, and it is an important feature for predicting defaulters, these records need to shared with business for correction.

Num_dependents

Null Values

3924 (2.6%)

We see that around 50% of the data has number of dependents as 0.

As the impact is low, we can apply a logic based on “Age”. By grouping our data for different age-group we can find out the median value of dependents for each group and impute that in the missing records.

Monthly income amount

- Null Values

- Minimum value of “Monthly income amount” is 0 which seems incorrect.

29896 records (29731 + 165 ~ 20%)

Convert low value data points to null value. With the good data that is available build a KNN prediction model to predict monthly income field and fill in the missing data in the original dataset

- Maximum value of “Monthly income amount” is too high.

4 records

Exclude the 4 records as they are non-defaulters.

Also, check for possible clusters in the data, separate them and then build the classification model for each cluster.


Missing Data imputation using KNN Regressor:


Imputing missing data in "Monthly Income Amount" field using KNN Regressor
Imputing missing data in "Monthly Income Amount" field using KNN Regressor

Concluding Notes:


  • All suggestions are implemented only after careful considerations, discussion with business and after written approvals.

  • While implementing the above suggestions we need to keep a constant check on the imbalance in the data and ensure it does not further increase as our primary objective is to build a classification model to predict credit defaulters.

  • For the issue related to imbalance in data, under-sampling can be done to get balanced datasets. Multiple such datasets can be extracted through random sampling; classification prediction model can be built for each dataset and final predictions to be arrived using ensemble technique.

Collect, scrape or pick up data from the domain you are interested in, explore, identify issues, fix them and proceed further with forecasting, predictions or identifying patterns in data which will be useful to your business.

 
 
 

Comments


bottom of page