Feature engineering is one of the most challenging aspects of the machine learning (ML) lifecycle and a phase where the most amount of time is spent – data scientists and ML engineers spend 60–70% of their time on feature engineering. AWS introduced Amazon SageMaker feature store during AWS re:Invent 2020, which is a purpose-built, fully managed, centralised store for features and associated metadata.
Features are signals extracted from data to train ML models. The advantage of feature store is that the feature engineering logic is authored one time, and the features generated are stored on a central platform. The central store of features can be used for training and inference and be reused across different data engineering teams.
Features in a feature store are stored in a collection called feature group. A feature group is analogous to a database table schema where columns represent features and rows represent individual records. Feature groups have been immutable since feature store was introduced. If we had to add features to an existing feature group, the process was cumbersome – we had to create a new feature group, backfill the new feature group with historical data, and modify downstream systems to use this new feature group. ML development is an iterative process of trial and error where we may identify new features continuously that can improve model performance. It’s evident that not being able to add features to feature groups can lead to a complex ML model development lifecycle.
Feature store recently introduced the ability to add new features to existing feature groups. A feature group schema evolves over time as a result of new business requirements or because new features have been identified that yield better model performance. Data scientists and ML engineers need to easily add features to an existing feature group. This ability reduces the overhead associated with creating and maintaining multiple feature groups and therefore lends itself to iterative ML model development. Model training and inference can take advantage of new features using the same feature group by making minimal changes.
In this post, we demonstrate how to add features to a feature group using the newly released updatefeaturegroup API.
Overview of solution
Feature store acts as a single source of truth for feature engineered data that is used in ML training and inference. When we store features in feature store, we store them in feature groups.
We can enable feature groups for offline only mode, online only mode, or online and offline modes.
An online store is a low-latency data store and always has the latest snapshot of the data. An offline store has a historical set of records persisted in Amazon simple storage service (Amazon S3). Feature store automatically creates an AWS glue data catalog for the offline store, which enables us to run SQL queries against the offline data using Amazon athena.
The following diagram illustrates the process of feature creation and ingestion into feature store.
The workflow contains the following steps:
- Define a feature group and create the feature group in feature store.
- Ingest data into the feature group, which writes to the online store immediately and then to the offline store.
- Use the offline store data stored in Amazon S3 for training one or more models.
- Use the offline store for batch inference.
- Use the online store supporting low-latency reads for real-time inference.
- To update the feature group to add a new feature, we use the new Amazon SageMaker
UpdateFeatureGroupAPI. This also updates the underlying AWS glue data catalog. After the schema has been updated, we can ingest data into this updated feature group and use the updated offline and online store for inference and model training.
To demonstrate this new functionality, we use a synthetically generated customer dataset. The dataset has unique IDs for customer, sex, marital status, age range, and how long since they have been actively purchasing.
Let’s assume a scenario where a business is trying to predict the propensity of a customer purchasing a certain product, and data scientists have developed a model to predict this intended outcome. Let’s also assume that the data scientists have identified a new signal for the customer that could potentially improve model performance and better predict the outcome. We work through this use case to understand how to update feature group definition to add the new feature, ingest data into this new feature, and finally explore the online and offline feature store to verify the changes.
For this walkthrough, you should have the following prerequisites:
- An AWS account.
- A SageMaker Jupyter notebook instance. Access the code from the Amazon SageMaker feature store Update Feature Group GitHub repository and upload it to your notebook instance.
- You can also run the notebook in the Amazon SageMaker Studio environment, which is an IDE for ML development. You can clone the GitHub repo via a terminal inside the Studio environment using the following command:
git clone https://github.com/aws-samples/amazon-sagemaker-feature-store-update-feature-group.git
Don’t forget to clean up the resources created as part of this post to avoid incurring ongoing charges.
- Delete the S3 objects in the offline store:
s3_config = describe_feature_group_result['OfflineStoreConfig']['S3StorageConfig'] s3_uri = s3_config['ResolvedOutputS3Uri'] full_prefix = '/'.join(s3_uri.split('/')[3:]) bucket = s3.Bucket(default_bucket) offline_objects = bucket.objects.filter(Prefix=full_prefix) offline_objects.delete()
- Delete the feature group:
- Stop the SageMaker Jupyter notebook instance. For instructions, refer to Clean Up.
Feature store is a fully managed, purpose-built repository to store, share, and manage features for ML models. Being able to add features to existing feature groups simplifies iterative model development and alleviates the challenges we see in creating and maintaining multiple feature groups.
In this post, we showed you how to add features to existing feature groups via the newly released SageMaker
UpdateFeatureGroup API. The steps shown in this post are available as a Jupyter notebook in the GitHub repository.
Follow us and Comment on Twitter @TheEE_io