MLOps Implementation Using Health Insurance — A Case Study
Up till now you may have made yourself familiar about MLOps, and the key components of MLOps and Iguazio with Feature Store. Continuing forward, this blog shall cover different aspects of Data Science from data preparation to model training and testing through a case study.
Before passing the model to the staging and production area, it is necessary to pick the best performing model. The model should possess the capability to distinguish the users in terms of their classes or to predict a continuous value.
Machine Learning has different types of formats such as Supervised, Unsupervised, Semi-Supervised and Reinforcement Learning.
Out of all such types of machine learning, this blog falls in the category of Supervised Learning, where we map out X features against our target values Y. In Supervised Learning, we have two different types of problems.
As mentioned above, we will be understanding different aspects of Data Science through a case study. This case study falls in the type of binary-class classification.
In binary-class classification, we have only two different classes which can be either True or False, Yes or No, will it rain or not, etc. Here we will be having a look on the fraud insurance claim detection problem.
Providing insurance to individuals are important for an insurance company, but the provisions of claim are somewhat crucial for their financial stability. To detect the valid and non-valid insurance claims, these companies are now looking forward in the area of data science.
This, however, is also helping them to adopt the change of 21st century to provide better facilities to their customers.
This data set contains different features which includes demographic and non-demographic variables, which are:
1. Age: this column contains the age of different people of the policyholder.
2. Sex: this feature contains the gender of policyholders where female = 0 and male = 1.
3. BMI: BMI (body mass index) is used to measure the fat of body based on the weight and height off any man or woman. the unit of BMI is (kg/m²). BMI contains different, but usually the beer by between 25 and 30 is considered as an ideal BMI score.
4. Children: the number of children letter policy holder holds.
5. Alcohol: this feature contains binary values which is zero and one, where 0 = non-alcoholic and 1 = alcoholic).
6. Region: this feature tells the residential area of the policy folder and the US, where 0 = Northeast, 1 = northwest, 2 = southeast and 3 = southwest.
7. Charges: this feature contains the cost of the medical bill of health insurance.
8. Insurance Claim: this last feature is our target variables, and it contains binary values, where 1 = valid insurance claim and 0 = fraud insurance claim.
After understanding the description of our data, we are now heading towards the prepossessing and exploratory data analysis part.
Exploratory Data Analysis
To perform Exploratory Data Analysis, we first need to import our dataset and it’s supporting modules.
In the above figure 1.0, we have imported different libraries such as seaborn and Matplotlib for visualization and NumPy, Pandas for data processing. At first we should need to load our dataset and have a look on its statistical information.
In the above data, describe() output, one new additional variable is added as Median. In the Median section, we can see the median values for all the features.
But it is valuable for those features which contains continuous values like age, BMI and charges. Due to this, we can guess or identify if there are any outliers in the features or not.
Before moving forward for graph visualization, we need to identify whether the data contains any null value in any of the features. To perform the operation, we will use the isnull() method to search null values, Fig 1.2.
In figure 1.2 we can see that on the left-hand side we have our features and on the right-hand side we have the count of null values for each of our columns.
The description tells us that there are zero null values in all the columns. This gives us a clear signal, that we can now go for the data visualization and explore some hidden information.
As we know that our problem statement is about health insurance claim so therefore, it is vital to explore the BMI column because it describes about the health of a person. BMI or body mass index have certain ranges which tells about the fattiness of a person.
BMI can be categorized in four different classes, having certain BMI range:
1. If the BMI of a person is below 18.5, then he/she is underweight
2. If the BMI of a person is between 18.5 and 24.9, then he or she is healthy
3. If the BMI of a person is between 25 and 29.9, then he or she is overweight
4. If the BMI of a person is between 30 and 39.9, then he or she is obese
To visualize the BMI, we can check the distribution of column through a histogram.
The above diagram shows the histogram of the BMI feature. In this feature, we can see the distribution for the BMI values from 15.96 to 53.13 as their minimum and maximum values, respectively.
On the contrary, we can check the distribution of our histogram with respect to the 4 ranges of BMI. The X-axis contains the BMI values and Y-axis contains their frequency.
It can be observed from the graph, the obese categories contain high frequency values, therefore, to understand the BMI with respect to the insurance claim, we can plot a pie chart. To understand their percentage of occurrence in both Valid and Fraud Insurance claims.
In the fraud insurance claim, almost all the classes are evenly distributed except for the underweight category. In the valid insurance claim, can observe that 66.67 percent of the True/Valid insurance claim are from the obese category and 27.20 percent from the overweight. It sums up that 93.87% of the policy holder contains BMI score of more than 25.
Apart from BMI score in Fraud claims, are there any other factors involving that may affect the model to classify someone as a non-valid claim?
For this, we can try to check the relation between the X and Y variables. To do so, the heatmap plot can be a good choice, because it displays the correlation of features to each other.
In a brief statement, Correlation is the measurement of linear relationship between two variables that are quantitative. (e.g., age and salary). The correlation ranges between -1.0 to +1.0.
If the correlation between two variables:
1. Is negative, it will be considered as negatively/inversely correlated.
2. Is positive, it will be considered as positively/directly correlated.
3. Is zero, it will be considered as no-relation between the two.
In the above Fig 1.4, we need to understand the relation between X variables to Y. In the last row, we can notice that BMI, children, charges, and alcoholic features have some significant relationship against the insurance claim. The BMI, charges and alcoholic features are directly related and the children feature is inversely correlated.
From the above analysis, we can conclude that for the insurance claim, there must be some criteria that the policyholder should not be a parent of children more than n numbers. The medical charges indicate that the person does need an insurance if their medical bills are high above the margin.
It is also known that alcoholism is quite common in the west, therefore, if someone is alcoholic then it may not affect that they may not claim the insurance.
The BMI is directly related to the insurance because it tells us about the fattiness condition of a person, because if a person is overweight or obese then he/she may have other health issues as well.
After performing some exploration, we can now more forward towards the Machine Learning section, so that we can classify our new upcoming data points.
To build such a model that can classify the binary classes, we would be using six different machine learning algorithms. These models will be trained and evaluated on the same dataset, so that we can pick the best one out to enhance/tune its performance.
The six models are below:
1. Logistic Classifier
2. Random Forest
3. Support Vector Machine
4. Naïve Bayes
5. K — Nearest Neighbor
6. Decision Tree
Before getting started, we need to import the dependent libraries to bring all the algorithms under one roof.
In the above Fig 1.5, we are using the Sklearn module to import all the algorithms along with their accuracy check methods. The data has been separated into train and test splits with the ratio of 80:20.
In all the models, we will be validating our all six models through accuracy score. All the models are using the same conventional method and as a baseline, they are being trained with their default parameters. To find out which model has the best performance, let us aggregate all these models into a single sorted DataFrame in ascending order.
In the above figure 1.5, we can clearly see the Random Forest Classifier performed with the highest accuracy score of 95.1%. We can now select this model to optimize it through hyper-parameter tuning.
Hyper-parameter Tuning for Random Forest
In a brief introduction, Random Forest Classifier uses a technique of ensemble to combine different classifiers to generate solution and predict output for the problem. Random Forest contains many different parameters, some of which are:
1. Max Features: It denotes the random subsets size of features to split a node.
2. Min Leaf Samples: The minimum number of samples at the last/leaf node.
3. Max Depth: It denotes the depth of the tree.
Random Forest also contains many parameters as well, but here, we will be considering the above-mentioned parameters to tune them with GridSearchCV.
In the above Fig 1.5, 3rd cell. The grid search is using the below parameters:
1. Rf: The Random Forest model
2. Parameters: The dict parameters contains Rf parameters along with their different values
3. N_jobs: The no. of CPU cores to be used, -1 = all cores
4. CV: Requires the cross-validator, here we are using 5 splits for cross validation
5. Scoring: A technique to evaluate the performance of the model on the test set
After training the model using the GridSearchCV, it outputs the best score after validating the model on different parameters using the dict type variable “parameters”.
Below in the last cell is the best estimator, the best estimator displays the Random Forest which contains the best parameters in order to provide the accuracy of 97.6%.
Finding the best features
As we did in the above section of Exploratory Data Analysis to find out some relevant information and features that could impact the results, here to cross validate, we are using the rf_grid_search.best_estimator_.feature_importances_ method to find the top performing features along with their coefficients.
In the conclusion section, we are now at the end of our data science part. In the next step we will be looking on how to upload our machine learning model on the cloud platform, for that we will be using the Iguazio platform to upload our model into production and deployment stage.
On the contrary, we will be having a look into the Iguazio’s Feature store and its other interactive features that are amusing for any data scientist. So stay tuned!
Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and data analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning in the energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.