Calculating the Probability of Loan Repayment — Using MLOps for Credit Scoring

7 min readJan 14, 2022

The banking sector is a risky business; credit risk is an excellent concern for financial institutions and the entire business world.

Banks want to give loans on a surety of its return. Therefore, when the company receives a loan application, the company has to make the correct decision for loan approval based on the applicant’s profile.

Introduction

Credit risk management is an essential exercise for every financial institution. For example, banks need to study or investigate the influence of credit scoring on loan repayment. Two types of risks are associated with the bank’s decision:

It is a significant loss for the company to reject a loan request if the requester can return it.
It is a significant loss for the company to accept a loan request where the borrower will not pay the loan in the future.

This case study discusses real-world business problems and provides excellent insights into how EDA and machine learning models help minimize losses in the banking and finance sector.

The given data contains information about past loan applicants and whether they ‘defaulted’ or not.

This case study aims to identify patterns that point to a person being likely to default, which may be used to take actions such as refuting the loan, reducing the amount of loan, lending at a higher interest rate, etc.

When a person applies for a loan, there are two types of decisions that the company could take:

Loan accepted: If the company approves the loan, there are 3 possible situations described below:

Fully paid: Applicant has fully paid the loan.
Current: The installments process is ongoing, i.e., the loan agreement is not yet completed. These candidates are not labeled as ‘defaulted.’
Charged-off: Applicant has not paid the installments in due time for an extended period, i.e., they have defaulted on the loan.

Loan rejected: The company had rejected the loan. So, it contained no transactional history. This makes it an irrelevant event to our case study.

Dataset description

Dataset consists of 74 columns and 887973 rows. These columns represent different features and records of borrowers. E.g., some of the fields are:

Exploratory Data Analysis and Feature Engineering

The goal of EDA is to get to know which variables are important, visualize the data, and summarize the data. Feature Engineering is a process of selection and transformation of input features. For EDA, Pandas, seaborn, and matplotlib libraries are used.

· Out of the given 74 columns in the data set, ‘loan_status’ is our target value. The unique values of loan status are:

Fully Paid
Charged off
Current
Default
Late (31–120 days)
In Grace Period
Late (16–30 days)
Does not meet the credit policy. Status: Fully Paid
Does not meet the credit policy. Status: Charged Off
Issued

A loan enters default when a borrower fails to pay the lender per the terms in the initial loan agreement. Charged off is a debt that a creditor has given up trying to collect on after the debtor has missed payments for several months.

So, by definition, we will merge the charged off, Late (31–120 days), and default in a default column. After merging, new values are:

Fully paid = 0.7815

Defaulted = 0.2184

· Installment and loan amount showed a good correlation, so let’s explore these features next.

Installment is the monthly payment owed by the borrower if the loan originates. The loan amount is the listed amount of the loan applied for by the borrower.

If the credit department reduces the loan amount at some point in time, it will be reflected in this value.

Figure 1.1 and Figure 1.2 show a histogram of loan_status with Installment and Loan Amount, respectively.

Fig 2.1 and Fig 2.2 respectively show that the default and interest rates are higher for the riskier grades.

For feature engineering, drop all the irrelevant columns that have a low correlation with loan status or other integer values i.e., id and member id, and some categorical features that are not relevant to the prediction of the loan status.

Check the percentage of null values in columns and replace those columns where the percentage of null values is low with median value and drop those columns with a significant number of NaN.

Table 1: the percentage of null values in different columns

Unique values in emp_lenthg column are ‘10+ years’, ‘< 1 year’, ‘3 years’, ‘9 years’, ‘4 years’, ‘5 years’, ‘1 year’, ‘6 years’, ‘2 years’, ‘7 years’, ‘8 years’ , nan. So, after encoding it will become 10., 0., 3., 9., 4., 5., 1., 6., 2., 7., 8., nan.

Another interesting column is home ownership. As seen in Figure 3, OTHER, NONE, and ANY do not have many values, so they would be merged into RENT. The default values are the same in Rent and Mortgage.

Figure 3: Histogram of Home Ownership with Loan Status

Categorical columns do not have numerical values. For this issue, dummy values are created.

Model building/implementation

Different machine learning models have been implemented to make sure that our project is predicting accurately. Some of these models are Artificial Neural Networks, XGBoost, and Random Forest.

First of all, data is divided into input features and target features. In our case, Loan Status is our target feature, and we want our model to predict whether the borrower would default or fully pay the loan, making our problem a binary classification.

Sklearn library was used for the train and test division of data.

X has all the input features, and Y has the target feature. Our dataset is ready to push into models.

Artificial Neural Network

An ANN is based on a collection of connected units or nodes called an artificial neuron, which roughly models the biological brain’s neurons. TensorFlow library was used with Keras layers and the Keras sequential model to implement the artificial neural network.

Input features were transformed with the help of Sklearn preprocessing of MinMax, the optimizer was Adam, and Metric was the area under the curve.

Table 2: Artificial Neural Network performance metric on the test set

XGBoost

XGBoost provides a parallel tree boosting that solves many data science problems quickly and accurately.

Random Forest

Random forests or random decision forests are an ensemble method classification, regression, and other tasks that operate by building many decision trees at training time. For classification tasks, the output of the random forest is the class selected by most trees.

Conclusion

This case study is very challenging and has many issues. One of the significant issues with data is imbalance. The default that is a crucial outcome has very few data points in the dataset.

For imbalance datasets, accuracy is not an excellent metric to evaluate the performance of the models. Although none of the models performed exceptionally, the accuracy and macro average of the artificial neural network are higher than XGBoost and Random Forest.

There are a couple of techniques to handle the data imbalance issue. Hyperparameters tuning can also improve the performance of models. The next step is to implement this case study in Kubeflow and use Kubeflow’s components like Katib for hyperparameters tuning.

Author Bio

Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and data analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.