Understanding a Health Insurance Case Study through Iguazio
This is the third part and final chapter of this Data Science case study series through Iguazio. Up till now, we believe that you may have covered the last two blogs which includes:
1. Introduction to MLOps with Iguazio’s Feature Store
2. Case Study on Insurance claim Dataset
In this blog, we will be continuing forward to move our model in to the production stage using Iguazio platform. Along the path, you will be getting the opportunity to understand various aspects of Iguazio environment, different connected technologies, i.e., MLRun and Nuclio.
Before the main part, let us revise the concept of Iguazio in brief that we understood in the first blog. Iguazio is a platform designed for data scientists that provides the ease of doing end-to-end development and deployment.
On the contrary, Iguazio’s feature store provides the capability to a data scientist to work, monitor and process large amounts of data, both in offline and online mode. It comes with ML pipeline, that includes:
1. Ingest data and build online and offline feature from any source
2. Train and evaluate models continuously
3. Deploy models in production
4. Monitor your model and data
Nuclio for Iguazio
For the convenience of data scientist in server production, Iguazio provides a serverless function named as Nuclio. It is an open-source product that not only simplifies data management during model serving, but Nuclio is also a high-performance framework with low latency that executes over CPUs/GPUs and event triggers.
Nuclio is designed to automate Data Science CI/CD pipeline on serverless functions.
The functions of Nuclio can be used to:
1. Gather the data from multiple streams on an ongoing basis and to query them as well. As it is offering built-in functions to storing and real-time streaming sources such as databases, Apache Kafka, Amazon Kinesis.
2. Process machine learning inferences on the serverless machines with low latency and high throughput.
Just like Nuclio, Iguazio also provides MLRun as an open-source framework which is used within the data science community. Through this framework, we can keep track of the code and the inputs/outputs that are being fetched into the model for inference.
It provides the service to put abstract layers on technologies while enabling data scientists and engineers.
MLRun consist of 4 layers:
· Feature and Artifact Store
Through this, data and features from multiple streams and repositories can be processed, handled, stored, and ingested in a unified way.
· Elastic Serverless Runtimes
Elastic Serverless Runtimes helps to manage microservices workloads with destined or specific runtimes and also converts simple code to scale.
· ML Pipeline Automation
It provides the service to create an end-to-end machine learning orchestration. The pipeline includes data preparation and pre-processing, model training and testing, model deployment on real-time production server, and end-to-end monitoring.
· Central Management
The central management layer provides a unified portal, accessible from anywhere. The portal includes the user interface, command line interface, and an SDK.
After understanding Nuclio and MLRun with Iguazio, it is now time to focus on the deployment of our model through automated workflow via Iguazio.
The environment of Iguazio is well-structured, the project dashboard can be called as the starting point, where it contains multiple operations and tools that can be operated and applied on each project.
The Iguazio’s project environment provides a summarized version of all the projects created. In the 2nd row, 3rd column lies our insurance claim project named as rcml-insurance-newpipeline-admin.
Within the cards of each projects are attributes which displays relative information about their project such as:
· Running models
· Failed Models of last 24 hours
· Total models attached
· Feature sets
· ML functions
Within the environment of each project, Iguazio provides various functionalities for the user, through which they can perform some operations on their dataset or models. Here, we will be covering those aspects that are important and relevant for any learner.
By checking the jobs view, we can get all jobs and workflows that are currently running on this project or have been running in the past.
The get_data() function holds the job to acquire data from different sources. In the Artifacts section is the insurance claim dataset for our model to be trained on. The get_data processes the information from different sources and accumulate them. By creating a function in get_data, we can visualize the data effectively for further needs.
The train job consists of different attributes that provides necessary information about the model training, from training the model to visualizing its performance metrics.
The tab overview provides some meta data about the model training process, check Fig 1.4. We can get the knowledge about who ran the job (UID), link to the actual code for this job (Function), etc.
One of the important aspects of training job is the Artifacts tab, Fig 1.3. When a job is running, it generates all the relevant information to assess its performance. In the insurance claim notebook, we saw that our important artifact is the feature importance out of many, Fig 1.6.
The feature importance tells us about the quality or relevancy of the features with respect to their coefficients. It tells us about the importance of each feature when making a prediction or classification . The feature importance is arranged in an ascending way.
Apart from features importance, the artifacts contain other information such as different scoring metrics like binary precision recall, confusion matrix, etc. It is on the user to save the graphs or values of those scoring matrices that can be relevant for the model.
In Iguazio’s notebook, the model is trained and compared automatically for the best one through accuracy. In the results section, Fig 1.7, we can see different models with their information.
The red arrow shows a mark on the best performing model selected automatically. The green box denotes the package name for the relative model, on the top is Random Forest Classifier from sklearn.ensemble class. The dark blue box denotes the accuracy metrics for all the models. We can check the best performing model is Random Forest Classifier with 93% of accuracy.
The Fig 1.8 shows the section of model training logs. In the logs section we can see different logs of model training, error reports, etc.
After train, comes the test job, just like train consists of different attributes the test job also contains similarities to it.
The test job attributes also provide the meta data and information related to the test results. In the Overview tab, the user will be able to view different information about the user who is working on this job. Likewise, the Results tab displays performance score of the model across different scoring metrics such as: accuracy and F1-score.
In Fig 2.0, the train model overview shows us the meta data of the best performing model, the Random Forest Classifier. Such as their scoring metrics which are accuracy, f1-score, precision-score, etc.
As we are comparing the models to accuracy therefore, in the train model we can see the Random Forest Classifier.
Real time functions, also known as Nuclio, serve as serverless machines to run your machine learning models within your inference layer.
In Fig 2.1, we are using Nuclio’s serverless function to run the model on the inference layer using Iguazio platform. On the contrary, you can also check some meta data about our function such as memory consumption, CPU/GPU usage, environment runtime type, etc.
The Fig 2.2 is showing us a cool feature of Iguazio and that is the testing of our model on different feature values. On the right side, we can test our model by passing values as an input parameter in JSON format. Input formats are user defined. Below the input section is the output that the model returned.
To perform the testing operation, we require a framework to automate the work of receiving inputs/outputs and other tasks as well. And for this, we are using MLRun.
Machine Learning Pipeline
At the end of the day, our main goal is to build an automated machine learning pipeline. Here, we have used Kubeflow pipeline as a part of the solution .
In this figure, we can see our project pipeline. From get-data to summary and train to all the way down to model deployment and testing. We can have a look on our designed pipeline that comprises of different steps.
We can manage the flow of data and training the models. In our pipeline, we have created different jobs and each job has their own task assisted. Such as the get-data section, it contains and process the information about the machine learning pipeline, meta data, scoring metrics, insurance-claim dataset, logs, visualizations, etc.
Above in the Fig 2.4, in the get-data section, we can see the information about our dataset. This information can be visualized on different plots based on their variables or any certain type of information that a data scientist requires.
These visualizations can help to know about the data in real-time or can help to detect if there is any occurrence of drift within the dataset. In the pipeline, there are other jobs such as summary, train, model deployment, and test.
The figure 2.5, tells us about the summarized version of model pipelines, meta data, etc. We can see different plots relating to our dataset, that gives us more information about the data. It is clear from the above figures 2.3 and 2.4, that the job get-data relates to summary and train.
The information that is processed in get-data, the meta data, and visualizations is transferred into the summary section, where it gets stored. On the contrary, the required information is being send to train job. Where it gets pre-processed, featured engineered and trained.
Just like others, train job is also considered as a first-class citizen in this machine learning pipeline, having more priority. In the above figure 2.6, the train job holds sensitive responsibilities. The train job also holds another responsibility to select the best models, according to their accuracy or any other metrics defined.
All the training is being done inside an online Jupyter Notebook, provided by Iguazio. The best model is sent for testing and deployment.
The train job relates to two other last nodes, referred as model deployment and testing. The above figure 2.7, relates to model testing. When the model gets trained, the one with the best performances makes it way to test and deploy section.
On the right hand, we can see the same attributes for our test job like train, summary, and get-data. The inputs and outputs can be checked for the inference.
The last section of this kf-pipeline is about the model deployment, just like the train job, the deployment of the best performing model is crucial. Best performing model for our insurance claim detection will get deployment on the Nuclio serverless function for inferences.
As we discussed in the first blog, that MLOps is different from DevOps. Because the job after deployment isn’t done yet. Real world data changes time to time, therefore, it needs to be monitored so that we can produce the best results. Iguazio provides functions and operation to detect drifts in our data and inference. Due to which, updating the pre-processing steps and model becomes easy for a data scientist. And the cycle repeats in MLOps.
Now we have completed our MLOps with Iguazio series in three different blogs which not only covers end-to-end Data Science life cycle approach but also about MLOps.
In this 3-part series, we understood some core concepts of MLOps and got our hands on one of its platforms, Iguazio. Many articles do cover building a proper machine learning model but deploying a model in production, building a pipeline to automate the job, or taking care of data in real-time is tedious and that is where it gets hard.
The platform Iguazio for MLOps takes a good care for data scientists when it comes to machine learning lifecycle, model building, deployment, and monitoring. Saving their time by automating the pipelines with in-built frameworks in a clever way.
1. MLRun | Automated and Scalable Pipeline Orchestration. (2021, April 7). Iguazio. https://www.iguazio.com/open-source/MLRun/
2. Iguazio Product tutorial 2021. (2021, June 6). YouTube. https://www.youtube.com/watch?v=Fg4PYL7r43U
3. Brownlee, J. (2020, August 20). How to Calculate Feature Importance With Python. Machine Learning Mastery. https://machinelearningmastery.com/calculate-feature-importance-with-python/
Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and data analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.