Will the Customer Pay Back the Loan? Using Feast to Analyze Credit Scoring Cases

Introduction

As the machine learning models are increasingly used in real-life applications, emphasis on data features and their management has increased as they are often used recursively.

This recursive process can put a strain on data scientists. A feature in machine learning is defined as individual measurable property.

For example, in insurance claim detection, features may be age, alcohol, region, etc., or in the case of credit scoring, features may include purchase history, loan history, mortgage, etc. This problem is addressed with the use of feature stores.

This blog will discuss feature stores and how they make the process more efficient. Also, we will be discussing Feast as a feature store, its installation, and accessing it through Kubeflow.

In the end, we shall have hands-on practical use of Feast as a feature store, making it an end-to-end application.

What is a feature store?

Feature stores are used to manage data sets and pipelines in production. Feature store acts as a central place to store and serve the curated features across multiple pipeline branches for optimal use of resources. Feature stores, in general facilitate with:

1. Automation of feature computation, backfills, and logging

2. Production of new features.

3. Sharing of features pipelines across teams

4. Achieving consistency between training and serving data

5. Tracking feature versions and lineage.

Components of a feature store

1. Serving

2. Storage

3. Transformation

4. Monitoring

5. Registry

Serving:

The definition of a feature when training a model and when serving must be precisely the exact; otherwise, the training-serving skew will be introduced, which can be hard to debug.

Storage:

Feature data can be stored in two different ways; online and offline (both serve different purposes). Therefore, feature stores contain an online and offline storage layer to support the different systems.

Transformation:

ML models use data coming from different sources as streaming data, real-time data, or batch data. Feature stores transform this new data from various sources into feature values.

Monitoring:

When a problem arises in an ML model or when the accuracy of a model degrades, it is usually caused by some problem with the data. Feature stores calculate metrics (on the feature data that they store and serve) to point out the data quality.

Registry:

The registry acts as an interface to interact with the feature store. Different data science teams use the registry as a common platform to explore, develop, collaborate, and publish new definitions within and across the teams.

Tools Utilized

Understanding what a feature is and how it works. Now, we shall discuss what tools collectively are needed to implement an end-to-end application with the efficient use of feature store.

FEAST

FEAST is an open-source operational data system developed to manage and serve the machine learning features to models in production. It combines an online store (for low latency, real-time applications) and an online store (for batch scoring or model training) into one tool.

Feast can perform functions as:

· Load streaming and batch data: Feast can load data from a range of sources like streams, object stores, databases, or notebooks.

· Standardization of feature definitions: Feast serves as a single source for all feature definitions for the entire organization.

  • Historical serving: Feast can retrieve the features that already exist in the system. Feast also ensures point in time correctness of data (to ensure consistency of features)
  • Online serving: Feast can use the features for low latency applications looking for real-time features.

· Sharing and reusage of features: Feast offers a centralized directory where different teams can share, reuse and track features across various projects.

The Feast Architecture:

Feast can take data from an online store (for real-time data) and offline store (batch data) and provide that to model serving and model training, respectively. This architecture of Feast makes the process simpler and more effective to understand the architecture, see fig 1.

Figure 1 Architecture of Feast as a Feature Store

Terraform

Terraform is a tool that allows to build, change, and version infrastructure efficiently. This includes low-level components such as compute instances, storage, networking, and high-level components such as DNS entries, SaaS features, etc. Terraform can manage both existing service providers and custom in-house solutions.

For installation of Terraform, download the appropriate package as a zip archive. Unzip the file, and Terraform runs as a single binary with the name terraform. For details of downloading and installation of the package, please refer here. (How to Install Terraform on Ubuntu 18.04 LTS (howtoforge.com))

Kubeflow

Kubeflow provides a mechanism for deploying machine learning (ML) workflows on Kubernetes in a simple, scalable, and portable manner.

The goal of Kubeflow is to provide a straightforward way to deploy the best possible open-source schemes for ML to different infrastructures. In addition, Kubeflow can be assessed where you can run Kubernetes.

Installing Feast with Kubeflow

To use Feast with Kubeflow, follow the following steps:

1. Install Feast

2. Create a feature repository

3. Deploy your feature store

4. Build a training dataset

5. Load features into the online store

6. Read features from the online store

Here we will use a tutorial to demonstrate the use of Feast as part of a real-time credit scoring system.

Given that, we already have the following:

• A primary training dataset, which is a loan table. The dataset forms of historical loan data with associated features and a target variable (whether a user has defaulted on the loan.)

• Feast shall be used during training to enhance the dataset with features as zip code and credit history from S3 files (in this case, the S3 files shall be queried through Redshift.)

• Feast again shall be used to serve the latest features for online credit scoring using DynamoDB.

Requirements

· Python 3.7 (your Kubeflow environment must have python 3.7 or higher)

· Terraform (v1.0 or later)

· AWS CLI (v2.2 or later)

Setup

Setup can be divided into two parts:

1. Setting up Redshift and S3

2. Setting up Feast

Setting up Redshift and S3

To begin the process, we shall set up data infrastructure to simulate the production environment. We shall deploy Redshift, an S3 bucket containing desired features (zip code and credit history). You will also need to clone the repository from pull given by name, “fix infra scripts and use variables for the repo config #4, find the pull here(fix infra scripts and use variables for the repo config by alikefia · Pull Request #4 · feast-dev/real-time-credit-scoring-on-aws-tutorial · GitHub) [MM1]

  1. Initialize Terraform

Command:

$ cd infra
$ Terraform init

2. Setting up Terraform Variables

Input after initialization command:

$ export TF_VAR_region=”us-west-2"
$ export TF_VAR_project_name=”your-project-name”
$ export TF_VAR_admin_password=”$(openssl rand -base64 32)”

3. To visualize Terraform plan. Use following command:

$ terraform plan

4. To deploy your infrastructure, use the following command:

$ terraform apply

5. After deploying the infrastructure, we shall observe the following outputs from Terraform

redshift_cluster_identifier = “my-feast-project-redshift-cluster”
redshift_spectrum_arn = “arn:aws:iam::<Account>:role/s3_spectrum_role”
credit_history_table = “credit_history”
zipcode_features_table = “zipcode_features”

6. Next, we shall create a mapping from the Redshift cluster to the external catalog

aws redshift-data execute-statement \
— region us-west-2 \
— cluster-identifier [SET YOUR redshift_cluster_identifier HERE] \
— db-user admin \
— database dev \
— sql “create external schema spectrum from data catalog database ‘dev’ iam_role ‘${tf_redshift_spectrum_arn}’ create external database if not exists;”

7. We now shall be able to query zip code features by executing the following statement

$ aws redshift-data execute-statement \
— region “${TF_VAR_region}” \
— cluster-identifier “${tf_redshift_cluster_identifier}” \
— db-user admin \
— database dev \
— sql “SELECT * from spectrum.zipcode_features LIMIT 1;”

8. To print results:

$ aws redshift-data get-statement-result — id [SET YOUR STATEMENT ID HERE]

9. Finally return to root of the repository.

Command:

$ cd ..

Setting up Feast

For this blog, we are more focused on the setup of Feast. Feast can be easily performed by following the given steps.

1. Install Feast using pip.

Command:

$ pip install feast[‘aws’]

2. For this case, we have already set up a feature repository in feature_repo/, so creating a new repository isn’t needed. However, if you want to initialize one, it can be done using following command.

Command:

$ feast init -t aws feature_repo # For reference only.

3. We need only to configure the feature store.yaml/ in the feature repository as we don’t need to create a new feature repository.

Set the fields under offline_store to the configuration you have received when deploying your S3 bucket and Redshift cluster.

Implement the feature store by running apply from within the feature_repo/ folder

Command:

$ cd feature_repo/
$ feast apply

Output:

Applying Feast shall create output as shown in the figure below:

Figure 2 Output of ‘Feast apply’

4. Next, we shall load desired features into the online store using the materialize-incremental command. This command will load the values of the latest features from the data source to the online store.

Command:

$ CURRENT_TIME=$(date -u +”%Y-%m-%dT%H:%M:%S”)
$ feast materialize-incremental $CURRENT_TIME

5. Finally return to root of the repository.

Command:

$ cd ..

Case Study: Loan Scoring

When you ask a bank or a financial institution to provide you with a loan, your case is inquired through a statistical model. The information acquired from the customer’s profile calculates the probability of whether they will repay the loan or not. The entire process is termed “credit scoring.”

We’ll discuss how a credit scoring system functions in real-time using Feast store for this particular use case. The system is mandated to accept a customer’s request and respond within 100 ms to decide whether the loan request should be approved or not.

For this case study, we have three datasets. The first is historical loan data, as shown in the table in figure 3 below, containing the features based on previous loans for the current customer. It includes the column as “loan_status” denoting if a customer in the past has defaulted on a loan or not.

Figure 3 Historic Loan Data

The second dataset is the credit history dataset, as shown in figure 4. This contains the credit history on a per-person basis and is updated frequently by the credit institution.

Figure 4 Credit history

The third dataset is a zip code dataset, as shown in figure 5. This enhances the first dataset with additional features about specific geographic locations.

Figure 5 Enrich Features

Features from all these datasets are used as a single training dataset to build a credit-scoring model. We will assume that the incoming request contains the loan application features, as shown in figure 6.

Figure 6 Feast features for the application

Next, we will feed feast will historical features as shown in figure 7.

Figure 7 Feast getting the historical features

After getting the historical features, it is time to train and test the model. Model training based on historical features and model testing parameters is shown below in figure 8.

Figure 8 Model training and testing

We also know that feast takes advantage by getting online features. Feast getting online features is shown in figure 9.

Figure 9 Feast getting online features

Finally, a prediction is made. They either accept or reject the loan request, as shown in figure 10.

Figure 10 Model prediction

This beautiful model can be represented as a shareable web app using an open-source tool called Streamlit.

Streamlit

Streamlit is an open-source Python library used by data science and machine learning professionals. This app framework allows users to easily create and share beautiful, custom web apps for machine learning and data science.

To install Streamlit, go through the following command:

$ pip install Streamlit

To run Streamlit, go through the following command:

$ Streamlit run streamlit_app.py

For this case, Streamlit provides a visual representation to vary the features and show us the prediction of a loan being approved or rejected. For example, two customers with similar features except for their yearly income apply for loan prediction on Streamlit is visually represented in figures 11 and 12, respectively.

Figure 11 Streamlit representation for Loan approved
Figure 12 Streamlit representation for Loan rejected

Conclusion

As shown with a real-time application for loan scoring, we have demonstrated how Feast as feature store can be useful to a data scientist, how to install and setup entire project, and how it can be used to implement real-time practical applications.

Above all, this opens avenues for further discussion over commercial applications and implementations of Feast.

References

1. Amazon Web Services. (2021). Getting started with Feast, an open source feature store running on AWS Managed Services. [online] Available at: https://aws.amazon.com/blogs/opensource/getting-started-with-feast-an-open-source-feature-store-running-on-aws-managed-services/ [Accessed 13 Dec. 2021].

2. HowtoForge. (n.d.). How to Install Terraform on Ubuntu 18.04 LTS. [online] Available at: https://www.howtoforge.com/how-to-install-terraform-on-ubuntu-1804/ [Accessed 13 Dec. 2021].

3. Ning.Zhang (2021). Feature Store: Data Platform for Machine Learning. [online] Medium. Available at: https://towardsdatascience.com/feature-store-data-platform-for-machine-learning-455122c48229 [Accessed 13 Dec. 2021].

4. Pienaar, W. (2021). Real-time Credit Scoring with Feast on AWS. [online] GitHub. Available at: https://github.com/feast-dev/real-time-credit-scoring-on-aws-tutorial [Accessed 13 Dec. 2021].

Author Bio

Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and Data Analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store