MLOps Tools and Feature Engineering in Petrol Consumption
In one of our previous blogs, we introduced the concept of MLOps and how MLflow could be an effective tool for tracking an end-to-end machine learning lifecycle.
In this blog, we will discuss a case study of petrol consumption in US and attempt to explain various data science tools and concepts. The discussion will start from data exploration and data analysis and then move to feature engineering phase.
Once features are extracted, then we’ll use as an input for our model. In model building, we shall implement Random Forest and Linear Regression. Finally, evaluate the most optimal model with the most suitable features.
We aim to predict the petrol consumption (in millions of gallons/year) in 48 of the US states based on some key features.
· Petrol tax (in cents)
· Per capital income (in US dollars)
· Paved highway (in miles)
· Population of people with driving licenses
Data used for the said problem can be found here:
Exploratory Data Analysis (EDA)
We will be performing Exploratory Data Analysis to find out the following:
· Missing values (if any)
· Distribution of the numerical variables (mean, median, standard deviation, etc.)
· Normality of the data
· Relationship between independent and dependent features
For EDA, we begin with basic statistics feature of the Minitab (Minitab is a statistics package). Output is shown in Table.1. Basic statistics show that there are no missing values and also gives us rough idea about the distribution of the data using measures as Mean, Standard deviation, Minimum value, Median and Maximum value.
To further observe the distribution of the data, we plotted the histograms and box plots of the data as shown in Fig 1 and Fig 2.
Histograms and Box plots both show that the Petrol_tax, Average_income and Paved_Highways are skewed, whereas Population_Driver_Lincence and Petrol_Consumption have outliers.
Looking at the extreme standard deviation values, we decided to check normality for average income and paved highways.
Results for Anderson-Darling Normality Tests of the said features are shown in Fig 3 and Fig 4. Results show the p value is greater than 0.05 indicating that the data follows normal probability distribution.
After the normality test, we explored relationships between independent features (Petrol tax (in cents), Per capital income (in US dollars), Paved highway (in miles), Population of people with driving licenses (%)) and dependent feature (Petrol Consumption). To determine the relationship, we plot the heatmap of all the available feature as shown in Fig 5.
Heatmap shows strong positive correlation between petrol consumption and Population_licence. And strong negative correlation between Petrol consumption and Petrol_tax. (Refer to Fig 5).
Further looking at the relationship between the features, we observed that the states where the tax rates were higher had lesser paved highways which is contrary to what one would expect (Refer to Fig 6).
Also, it was observed that the relation between percentage population of drivers and petrol consumption follows a more or less linear relationship. (Refer to Fig 7)
As a part of feature engineering, we decided to do the following operations:
Creation of new features
1. Income Range:
This is a column which is created from Average_income. Based on the Quartile range, we have segregated the whole Average_income column into 4 categories and created a new column from it.
Example of dataset is shown in Table 2.
2. Petrol tax_range :
This is a column created from Petrol Tax. Based on the Quartile range we have segregated the whole Average_income column into 4 categories and created a new column from it.
Example of dataset is shown in Table 3.
As observed during EDA, features are varying in magnitude, range and units. In order to get them to a fixed range we use feature scaling.
It helps us to standardize the features in a fixed range. If it is not applied, then the machine learning algorithm weighs greater values higher, and smaller values as lower values regardless of the unit of the values.
Normalization (Min-Max )
Here we have applied the Normalization technique to our dataset where it transforms the data to the range of [0,1] or any specified range?
scaler = MinMaxScaler()
After feature engineering, we proceeded with development of ML model, development of the model from here can be traced as:
· Split data into train and test
· Model testing
· Model performance evaluation
Step 1: Split the data into train and test
In order to train our model, we used the 80–20 split model. Where 80% of the dataset is used as training data, and 20% of the data as test data for making predictions and evaluating our model performance.
Features used for training the ML model:
X(Input) : “Petrol_tax”, “Average_income”, “Paved_Highways”, “Population_Driver_licence(%)”
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
X_train :- (38,4)
X_test :- (10,4)
y_test :- (10,1)
Step 2: Build ML model by passing the training data
In this step, we will input the 80% of training data to our model for it to get trained.
Step 3: Model testing
In this step, we will predict the test set with the help of our trained model.
Step 4: Model performance and evaluation metrics
MSE: mean_squared_error(y_test, pred)
RMSE: np.sqrt(mean_squared_error(y_test, pred))
R-Square: metrics.r2_score(y_test, pred)
MAE: mean_absolute_error(y_test, pred)
Following evaluation metrics were used:
Root Mean Squared Error
It gives an idea of the average distance between the observed values and predicted data values. It tells you how concentrated the data is around the line of best fit.
Formula: RMSE = √Σ(Pi — Oi)2 / n
• Pi is the predicted value for the ith observation in the dataset
• Oi is the observed value for the ith observation in the dataset
It tells us the proportion of variance in the response variable that can be explained by the predictor variables in the model.
Mean Absolute Error
It tells us on an average how much predicted value is from the true value.
Mean Squared Error (MSE)
MSE is the average of squares of the “errors”.
Description of the ML Models
Description of the implemented models is given as:
Linear Regression (LR1) with the original features (without adding any new features)
Linear Regression (LR2) model after removing Paved_highways from Model 1 (LR1)
In this model based on the P values of X variables, we have removed Paved_highways as it has no relationship with the Y variable. We reran the model with the remaining 3 variables.
What does the P value mean?
Here the P-value means the probability of obtaining test results at least as extreme as the results observed, under the assumption that the null Hypothesis is true.
Null hypothesis: There is no relationship between the X variables and Y variables
Alternate hypothesis: There is a relationship
If the p <= 0.05 then Null hypothesis is false and should be rejected or vice versa.
Since P value of Paved_Highways is greater than 0.05 we do not reject the null hypothesis and, hence remove the variable (Paved_highways) from the model.
Linear Regression (LR3) model with new features
In this model, we have also used the newly created variables for modeling.
New features included: “Income Range”, “Ptax_range”
Random Forest Regressor (R1) model with original features (without adding any new features)
Here we have used the basic Random Forest Regressor model with default parameters and the original features (without adding any new features).
Random Forest Regressor (R2) model with new features
Here we have used the basic Random Forest Regressor model with default parameters and have also added the new features.
New features included: “Income Range”, “Ptax_range”
Predictions of All Models
The above is a prediction table or outputs of all the models.
Actual: It is the actual test data on which the predictions were made.
LR1: It is the prediction of Model 1.
LR2: It is the prediction of Model 2.
LR3: It is the prediction of Model 3.
R1: It is the prediction of Model 4.
R2: it is the prediction of Model 5.
Model Evaluation Metrics
Performance and Evaluation metrics of all models can be seen in Table 5.
We are using MSE to evaluate model performance, and from the above results we observe that Model 2 (LR2) and Model 4 (R1) perform better. Lesser the MSE, better the model is.
Though the results of Model 1 (LR2) are better than Model 4 (R1), we believe using Model 4 (R1) for Hyperparameter tuning will help us in getting better results as Random Forest Regressor Algorithm has a wide range of parameters which when tuned and used will produce good results.
In next blog, we will be discussing about Model Deployment using MLflow. Stay tuned!
Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and data analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.