Data Massaging

Sarita Upadhya
Jul 24
5 min read

For any machine learning task, the first step is to ensure that the data being considered is clean, reliable, aligned with the business objective to be achieved, taken from the recent past reflecting the current business scenario thus validating that it is fit for the purpose it is being used for. However, in real life situations data quality is usually a challenge, may be because it is collected manually, not standardized as per compliance or underlying rules, corrupted during data migration or many other reasons. Hence, checking the data and massaging it before proceeding with further analysis or implementation becomes crucial.

In this article we will pick up data from the finance domain which is available here. Further we will identify data quality issues and try and fix them using different possible techniques.

About Data:

Data is published by MIT on Kaggle and the objective is to build a classification model to predict credit defaulters. There are 12 columns or features in the dataset and all are numeric data types. Let us massage this data, identify all possible challenges, and arrive at appropriate cleanup solutions. Data has 150000 observations and each observation corresponds to a customer who has either defaulted or not defaulted.

Field Name	Description
ID	Unique Identifier
Default	Binary variable to indicate if a customer has defaulted (1 indicates default and 0 indicates no default). Target variable to be predicted.
CC_Usage_last_month	Percentage of credit card usage in the last month
Age	Age of the customer
Num_loans_payment_delays_3_months	Number of loans with payment delays of more than 3 months in the last 3 years
Debt-to-income_ratio	Financial debt-to-income ratio
Monthly income amount	Monthly income amount
Num_loans_and_Num_cc_last_year	Number of loans (including vehicle or mortgage loans) and number of credit cards last year
Num_loans_payment_delays_60_days	Number of times the client has been over 60 days late in the last 3 years
Num_mortgage_loans	Number of mortgage loans
Num_loans_payment_delays_30_days	Number of payment delays over 30 days in the last 3 years
Num_dependents	Number of dependents

Data Quality Issues:

Observations from Statistical Summary:

	Mean	std	min	25%	50%	75%	max
CC_Usage_last_month	6.05	249.75	0.0	0.03	0.15	0.56	50708
Age	52.3	14.77	0.0	41	52	63	109
Num_loans_payment_delays_3_months	0.42	4.19	0.0	0	0	0	98
Debt-to-income_ratio	353.00	2037.81	0.0	0.18	0.37	0.87	329664
Monthly income amount	6670.22	14384.67	0.0	3400	5400	8249	3008750
Num_loans_and_Num_cc_last_year	8.45	5.15	0.0	5	8	11	58
Num_loans_payment_delays_60_days	0.27	4.17	0.0	0	0	0	98
Num_mortgage_loans	1.02	1.13	0.0	0	1	2	54
Num_loans_payment_delays_30_days	0.24	4.16	0.0	0	0	0	98
Num_dependents	0.76	1.12	0	0	0	1	20

Below is the list of observations for every field:

Field Name	Observation
CC_Usage_last_month	The maximum value of “CC_Usage_last_month” seems to be very high. As it is a percentage field, anything beyond a value of 100 seems to be incorrect.
Age	The minimum value in “Age” field is 0 which is incorrect or bad data.
Debt-to-income_ratio	The maximum value of “Debt-to-income_ratio” is too high. 75% of the records lie in the range of 0 through 0.87. Valid values should be between 0 through 1
Monthly income amount	· Minimum value of “Monthly income amount” is 0 which seems incorrect. · Maximum value of “Monthly income amount” is too high.

Missing Values:

Field Name	Count of Null values
Monthly income amount	29731 null values out of 150000 records
Num_dependents	3924 null values out of 150000 records

Imbalance in Data (Classification Problem):

A classification model can be trained well, if the data is balanced. That is, there should be near to equal number of records in each class. If the data is balanced, the model being trained will have a good representation of data belonging to each class and will be able to learn the difference between the them effectively.

In the data we are working on, “Default” field is the target variable which has to be predicted using a classification technique and is binary in nature. So, we need to check if there is an equal representation of data for defaulters (value of 1) and non-defaulters (value of 0).

It is observed that 139974 records almost 93% of the total data belong to non-defaulters (value of 0) and only ~7% of the data belongs to defaulter class (value of 1) which we are interested in. Hence, there is a huge imbalance in the data which must be addressed.

Observations from Data Exploration:

223 records have “CC_Usage_last_month” greater than value of 100.
35065 records have “Debt-to-income_ratio” greater than value of 1.

There are 4 customers with “Monthly income amount” more than 1000000

165 records with “Monthly income amount” less than 100 including 0 values.
1 record with “Age” field as 0.

Suggestion to Fix the Data Quality Issue:

Field	Issues	Impact	Suggestion
CC_Usage_last_month	The maximum value in “CC_Usage_last_month” seems to be very high. As it is a percentage field, anything beyond a value of 100 seems to be incorrect.	223 records (0.14% of total records)	Can be separated from the original dataset and shared with business for correction.
Age	The minimum value in “Age” field is 0 which is incorrect or bad data.	1 record	As it is just 1 record, we can replace 0 with median value 52 to this record.
Debt-to-income_ratio	The maximum value of “Debt-to-income_ratio” is too high.	35065 records with value greater than 1 (23%)	As the number of bad records is high, and it is an important feature for predicting defaulters, these records need to shared with business for correction.
Num_dependents	Null Values	3924 (2.6%)	We see that around 50% of the data has number of dependents as 0. As the impact is low, we can apply a logic based on “Age”. By grouping our data for different age-group we can find out the median value of dependents for each group and impute that in the missing records.
Monthly income amount	- Null Values - Minimum value of “Monthly income amount” is 0 which seems incorrect.	29896 records (29731 + 165 ~ 20%)	Convert low value data points to null value. With the good data that is available build a KNN prediction model to predict monthly income field and fill in the missing data in the original dataset
- Maximum value of “Monthly income amount” is too high.	4 records	Exclude the 4 records as they are non-defaulters. Also, check for possible clusters in the data, separate them and then build the classification model for each cluster.

Missing Data imputation using KNN Regressor:

Imputing missing data in "Monthly Income Amount" field using KNN Regressor

Concluding Notes:

All suggestions are implemented only after careful considerations, discussion with business and after written approvals.
While implementing the above suggestions we need to keep a constant check on the imbalance in the data and ensure it does not further increase as our primary objective is to build a classification model to predict credit defaulters.
For the issue related to imbalance in data, under-sampling can be done to get balanced datasets. Multiple such datasets can be extracted through random sampling; classification prediction model can be built for each dataset and final predictions to be arrived using ensemble technique.

Collect, scrape or pick up data from the domain you are interested in, explore, identify issues, fix them and proceed further with forecasting, predictions or identifying patterns in data which will be useful to your business.