Introduction to Iguazio with Feature Store

Understanding MLOPs

When we refer or talk about MLOps, one point is quite certain that MLOps is the domain of developing and deploying different data science projects. MLOps can handle different types of complex data storage and processing, i.e., structured or non-structured data.

A continuous integrated pipeline is created to automate the workflow from developing the dataset to deploying the model.

The lifecycle of a data science project consists of the following steps.

1. Requirement Gathering

2. Exploratory Data Analysis

3. Feature Engineering

4. Feature Selection

5. Model Creation

6. Hyperparameter Tuning

7. Model Deployment

8. Retraining Approach

Fig 1.0 (Life Cycle of a Data Science Project)

Key components of MLOPs

MLOps influences different areas of a company and data science life cycle. To understand this phenomenon, we must first understand the key components of MLOps. According to Practical MLOps: Operationalizing Machine Learning Model by Alfredo Deza and Noah Gift there are five key components for MLOps:

1. Development

2. Deployment

3. Monitoring

4. Iteration

5. Governance

It is almost indispensable to have prior knowledge of ML algorithms before understanding the key components of MLOps (Alfredo Deza, 2021). Having prior knowledge of ML will help in the selection of an algorithm for any DS project because it can have an impact on the performance and the results being generated on the unseen inputs.

Development: It is essential to develop the right model as it has an impact on the data science life cycle. The development cycle of machine learning consists of business objective establishment, exploratory data analysis, feature engineering, model selection, model training and evaluation.

Deployment: Deployment can often become handy because the organization must work in a set of teams. Usually, it is the set of domain areas of different teams. Effective communication and exchange of information is necessary to avoid crashes and failures.

Getting the model ready for deployment can be rigorous because the all-combined team must take care of all the requirements that a model needs.

Monitoring: The term model drift becomes quite common when the model is deployed. Because the model can face new types of input due to which the performance can degrade.

This problem can occur when there is change of distribution in the inputs, and other types of problems can be tackled through continuous monitoring. So that the model can be retrained on the new samples that were saved in the feature store.

Iteration: It is now quite common that the new training data is generated to enhance the existing model from time to time and to compare the metrics of both the models. By doing so, we can find the variance in both versions and decide, either we should proceed with the new one or not.

Governance: To understand data governance, we must first understand the basics of data management. Data management is defined as “The business function of planning for, controlling and delivering data and information assets.”

According to DAMA (Data Management Association) data management has 11 components, Fig 1.1, where data governance sits in the middle and establishes a relationship with all the areas. So, data governance is a discipline which provides the necessary policies, roles and responsibilities and processes standards needed to ensure that data is managed as an asset [2].

Fig 1.1 (Reference from George F. about Data Governance and Data Management)

A major part of data governance plays a vital role in MLOps to manage the exchange of data from a part of pipeline to another in a seamless, safe and controlled manner, known as data interoperability.

Through this, data can be shared across different teams in a secure and private manner, thus, valuing the integrity of the data.

Importance of MLOps in Data Science community

Developing a model is just the tip of an iceberg. And staging the model for production is important and complex at the same time. It requires multiple domains knowledge, i.e., software engineering, DevOps and data science to get the project ready.

The below diagram, Fig 1.1, will help you to understand better about the vital role of MLOps for the data science community.

Fig 1.2 (MLOps and Data Science)

The work of data preprocessing and model development is intensive but putting the model for production, monitoring the model’s performance and system’s resources, performing error analysis, etc. is also very crucial.

The DevOps expert might not know how to analyze the data and the data scientist may not be fully aware of deployment and management steps. Due to which, MLOps becomes more important so that the complexity can be reduced and simplified to achieve the desired results.

What is a Feature Store?

MLOps continuously work on storing their data coming from all streams. A unified platform is required to perform operations such as data munging, data cleaning, feature extraction, etc.

This platform is known as Feature Store, an important platform used in MLOps to store, monitor, and manage the incoming data.

It also helps to share features across the organization and save their time by rebuilding them so they can be incorporated into the model. Even a simple feature store is more than just a store and its qualities include:

1. AI scalability

2. Monitor and analyzes features to target drifts

3. Stores statistics and metadata of the features so they can be collaborated and used as the notion of ground truth

4. Used features for training and inference

5. Beneficial for organizations which works on sensitive data for hospitals, public organization, etc.

Fig 1.3 (Flow of pipeline from Data Handling to monitoring)

Feature stores are becoming one of the core parts of data flow in MLOps. Many tech organizations are grounding their flags to provide feature stores with best infrastructure and CI/CD pipelines for MLOps. One of which is Iguazio, providing an automated CD/CD pipeline for the Data Science community [3].

Iguazio

Iguazio, a platform designed for MLOps, provides ease to a data scientist in deployments and productions. Iguazio comes with many benefits to the community such as:

1. Transforming the ML projects into business outcomes

2. Scaling the development, deployment, and management of the ML projects

3. Automating the end-to-end MLOps pipeline

Iguazio comes with a machine learning pipeline which consists of the below operations.

1. Ingest data and build online and offline feature from any source

2. Train and evaluate models continuously

3. Deploy models in production

4. Monitor your model and data

Fig 1.4 (Iguazio Pipeline)

Ingest data and build online and offline feature from any source

By using the Iguazio feature store you have the power to ingest and unify the data coming from different streams, whether the data is in any type of format.

The Iguazio feature store is an integral part to data science and engineering domains through which it provides the capability of advance data transformation, building features, model monitoring and governance under an umbrella.

Through the help of real-time serverless engines, complex logics processing becomes easy and to read the features from both online and offline sources. Latest features can be retained by online feature stores. Reads with low latency primarily in milliseconds and writes in high throughputs can be achieved with online store.

Whereas, for model training and batch predictions offline stores are intended to use. A large number of features are also saved in offline stores for further processing such as model training.

Train and evaluate models continuously

Through Iguazio, the process of model training and evaluation with new features from different streams becomes quite easy.

It also provides the opportunity to run different experiments on a scalable serverless machine learning and deep learning runtimes and track the process through automation. By doing so, you can also keep tracks on the version of data as well.

Deploy models in production

Deploying the best version of the model in production is essential and crucial. And how great that would be if it could be done in a matter of clicks. Once a model is ready for production therefore, through Iguazio in few steps you put the model in production and can also monitor its performance.

By continuously monitoring, the team will be able to detect the drifts in the model so that they can take necessary actions when required.

Monitor your model and data

Making the ML model live for production does not conclude the project. Detection of drifts, either concept drifts or data drifts, need to be monitored to keep the predictions stable.

Real time statistics, inaccuracy by the model and drifts can be detected and monitored. Keeping the features in data lineage also provides a layer of governance to the system.

Although, data governance is implemented in every step, but during monitoring, it is necessary to keep track of features according to data linage to store them with respect to data quality policies [3]. By Monitoring, re-training also becomes convenient by setting re-training workflows when required.

The part of drift detection, monitoring the features in real-time comes with the integrated dashboard that’s been attached with its feature store. Thus, making this platform a viable tool for MLOps in the data science community.

Closing Remarks

In the era of 21st century the technology is enhancing itself day by day and is speeding up. It has become necessary for the homo sapiens to quirk themselves to align in the race. One of them is MLOps, even though this domain is now starting to get the spotlight, but it is quite new because of AI.

AI also just gained its desired recognition just a decade ago and still is enhancing with a pace.

New tools, frameworks and packages are now built and released for the community in MLOps so that the work of Data Science and DevOps can become more easy, productive and be more viable to meet the business goals.

“The simplest way to define governance and MLOps is best practices and policies for business to successfully use machine learning in an explainable, repeatable and production ready manner. [4]”

References

1. Treveil, Mark. “3. Key MLOps Features — Introducing MLOps [Book].” Www.oreilly.com, learning.oreilly.com/library/view/introducing-mlops/9781492083283/ch03.html#a_primer_on_machine_learning. Accessed 1 Nov. 2021.

2. Firican, George. “What Is the Difference between Data Management and Data Governance?” Www.youtube.com, 7 Sept. 2021, www.youtube.com/watch?v=oU0RlsXunlw. Accessed 1 Nov. 2021.

3. “Iguazio’s Integrated Feature Store.” Iguazio, www.iguazio.com/feature-store/. Accessed 1 Nov. 2021.

4. Drenik, Gary. “Importance of Data, Governance and MLOps When Using Machine Learning to Drive Successful Business Outcomes.” Forbes, 17 July 2021, www.forbes.com/sites/garydrenik/2021/06/17/importance-of-data-governance-and-mlops-when-using-machine-learning-to-drive-successful-business-outcomes/?sh=253c9dd59f3c. Accessed 1 Nov. 2021.

Author Bio

Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and data analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.

--

--

Royal Cyber Inc is one of North America’s leading technology solutions provider based in Naperville IL. We have grown and transformed over the past 20+ years.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store