Presenting Feast: An Open-Source Feature Store for Machine Learning
Machine Learning Operations (MLOps) is a relatively new practice that revolves around models and automation. Therefore, an additional value asset is anything else required to make that model useful, including capabilities for an automated development and deployment pipeline, monitoring, lifecycle management, and governance.
When it comes to MLOps, the topmost priority is to make easily scalable models which are also easy to deploy to production. Many platforms can help make the process easier, and one such platform is Kubeflow.
Implement MLOps with Kubeflow
Kubeflow is the perfect platform to build and experiment with ML pipelines. It can work as an essential tool for ML engineers and operational teams to deploy ML systems to various environments for development, testing, and production-level serving purposes. One can easily use Kubeflow to deploy any machine learning project or model.
However, during the process, one may encounter duplication and waste of effort, which requires storing vast amounts of data.
For example, the team we worked with took daily screenshots of Apache Parquet files. It resulted in a lot of wasted data and also meant that every column in every file had to be manually changed retroactively to fix everything.
Therefore, if your machine learning project will scale even moderately, we believe you should have a feature store.
Looking into a Feast, first, we need to understand the goals of the feature store and how it could provide an additional advantage of using Feast with Kubeflow.
Goals of Feature Store
· Share features between teams and use cases and reduce the duplication effort
· Reduce complexity when deploying a model in production
· Decouples feature engineering from model development
· Works with tools people are familiar with
· Support real-time primitive feature
Why do you need a feature store?
One of the most crucial and frequently underestimated parts of machine learning solutions is feature extraction and storage. Machine learning models rely on features to interpret and understand datasets for training and production.
In modern machine learning solutions, a feature store is a trend that is becoming more common. A feature store, in theory, is a collection of features that can be used to train and evaluate machine learning models.
We believe you should have a feature store to scale your machine learning project in a moderate manner. However, many projects do not require one.
GO-JEK, like other fast-developing data science companies, is continually faced with feature extraction and discovery issues. You’d have to begin again if you wanted to use feature stores outside of a vast organization.
Fortunately, the open-source community is already working to change that. However, many machine learning teams have their pipelines for fetching data, creating features, and storing and serving them.
In this article, I’ll introduce Feast, an open-source feature store for ML, and show you how it resolves the difficulties.
What is Feast?
To operate machine learning systems at large scale level, teams need access to the wealth of feature data to train their models and serve them in production.
To resolve this issue, we discuss the release of Feast, an open-source feature store that allows teams to manage, store, and discover features for use in machine learning projects and serve segments to models in production. The Feast is an essential component in building end-to-end machine learning systems.
Get features to production
In large teams and environments, how features are maintained and served can diverge significantly across projects, introducing infrastructure complexity and resulting in duplicated work. The difficulties are as follow:
· ML pipelines are slow to iterate on: Engineering features is one of the most time-consuming activities in building an end-to-end ML system, but many teams develop features in silos. This results in a lot of redevelopment and duplication of work across teams and projects.
· Training and serving features are inconsistent: Models need information that can emerge from various sources, including occasion streams, information lakes, stockrooms, or note pads. Training requires access to historical data, and the model used for prediction requires the latest values. When data is isolated into many independent systems that require separate tools, inconsistencies occur.
· Data quality monitoring and validation are absent: The general data system did not consider the ML use case when it was constructed and did not provide the correct search for the time point of the characteristic data.
· Lack of feature reuse and sharing: This is a critical problem that we identified with this kind of evolution of a notebook into a production system, and teams are like in their ML opportunities.
Feast as one-stop solution
A feast is a system to solve the critical difficulties with production machine learning. Feast solves these difficulties by providing a centralized platform that standardizes the definition, storage, and access of options for coaching and serving. It acts as a bridge between information engineering and machine learning.
Feast handles the bodily function of feature information from each batch and streaming source. It additionally manages each warehouse and serves information basis for historical and also the latest data.
Employing a Python SDK, users make area unit ready to generate coaching datasets from the feature warehouse. Once their model is deployed, they’ll use a consumer library to access feature information from the Feast Serving API.
What Feast provides?
· Registry: A general catalog for exploring, developing, collaborating, and publishing new feature definitions within and across teams.
· Ingestion: A method of continuously ingesting batch and streaming data in offline and online stores and storing consistent copies.
· Serving: A feature retrieval interface, which provides time-consistent feature views for training and online services.
· Monitoring: A tool that allows the operations team to monitor the quality and accuracy of the data reaching the model and take action.
The MLOps enhance the quality, simplify the management process and automate machine learning and deep learning models in large-scale production environments.
Kubeflow also deploys ML systems to various settings for development, testing, and production-level serving.
Also, feast provides a consistent way to access features that can be passed into serving models and access features in batch for training.
Hassan Sherwani is the Head of Data Analytics and Data Science working at Royal Cyber. He holds a PhD in IT and Data Analytics and has acquired a decade worth experience in the IT industry, startups and Academia. Hassan is also obtaining hands-on experience in Machine (Deep) learning for energy, retail, banking, law, telecom, and automotive sectors as part of his professional development endeavors.