11 Mar Tooling up with a Feature Store
A Feature Store is useful tool to help manage Features used by Machine Learning models. I discussed in my previous blog the challenges related to Feature management, how to curate them and the desirable capabilities of a Feature Store. In this blog I will build upon this by looking at how you can create and use a Feature store. I will look at all capabilities highlighted in my earlier blog and compare two different options for a Feature Store against them.
There are several Feature Stores available for you to use and they typically breakdown into four types: build your own, SaaS (like Amazon SageMaker Feature Store), Open Source, or commercial products that you must host. We will look at two of them:
- Build-Your-Own— If you’re just starting or want to use the basic capabilities, then I recommend using DynamoDB and S3
- SageMaker Feature Store— If you need the advance capabilities, then I would recommend looking at Amazon SageMaker Feature Store
The following illustrates how you can build a basic feature store using DynamoDB and S3:
- Online Data: For online data in the above architecture, we are using DynamoDB tables. DynamoDB is an excellent choice as it offers sub-second latency for data access and it’s schema-less. We would recommend for major changes to a feature that you use a different table per version. Please make sure you enable point in time recovery or you backup your tables.
- Offline data: For offline data in the above architecture, we are using S3 buckets. An S3 Bucket is again an excellent choice as it can store a lot of data access and at very low cost. It also supports versioning, tight access controls, archiving and logging to provide extra capabilities.
- Shared account and access controls: The most important element of this is a shared account that holds S3 buckets and DynamoDB tables. This then allows the sharing of feature between accounts and environments. However, it important to consider the following:
- We recommend the use of IAM roles and Bucket and Table tags to control how changes are made to features during the lifecycle. For example, you do not want to allow direct changes to a production table from the experiment environment. If changes are needed, then maybe a new version of a table or bucket is created.
- For all features we recommend encrypting them at rest. However, for different levels of data categorisation then consider using custom KMS keys.
- We recommend using Service Catalog to allow Data Scientists and Data Engineers to create S3 buckets and DynamoDB tables. Service Catalog will use CloudFormation Templates to create S3 buckets and DynamoDB tables aligned to best practice such that the requirements on the Feature Store are consistently applied.
- Manual Updates: Jupyter Notebooks (especially those within SageMaker Studio) are a great tool for manually creating features and this should be done in a dedicated experiments account that is separated from the account that holds your feature store, to provide you the controls required.
- Regular Updates: AWS Glue and/or SageMaker Processing are powerful technologies for undertaking data preparation and feature engineering at scale. We recommend that once features are used in production that these processes are automated and run from a pre-production or production account.
- Real-Time: There are two main streaming services in AWS that you can use here – Kinesis or Kafka. If you choose to use Kinesis, you can use Kinesis Data Analytics to do time window-based aggregations. Then you can use a Lambda to update your feature in real time (best to use an online feature for this). We recommend that these processes are automated and run from a pre-production account or production account once the feature logic is stabilised.
We would also recommend using long retention periods (7 days) + storing the original in S3 with extra validation, and ensuring sufficient alerting and mentoring is in place to quickly spot any issues with the real-time update.
Amazon SageMaker Feature Store
AWS released Amazon SageMaker Feature Store as part of Amazon SageMaker Studio at re:invent 2020 — for brevity I am going to call it AWS SMFS. The following illustrates how you can use AWS SMFS in an multi-account architecture:
- Offline Features: AWS SMFS provides offline data access by using S3 and saving your curated data inside S3 buckets. This is great, as you can put it in your data lake and use Athena with Glue Data Catalog to the query the data. You could also use Redshift Spectrum to access it in RedShift.
- Online Features: AWS SMFS provides online data access for real-time low latency use cases. Now we do not know how this is done, but I would imagine that it uses Dynamo DB under the hood.
- Features Catalogue: AWS SMFS provides the ability to build a catalogue of all the features you have to aid discovery, reuse, and maintainability. To build a catalogue first you need to organise your features in Feature Groups and Feature Definitions. The best way to think about this that a group is basically like a table and a definition is column. You then have a Record identifier (primary key) for each feature (or row) in the group and values for the Feature Definitions.
- Meta Data: For each Feature Group you can store additional meta data such as a description and tags to enrich your catalogue. To me I think this is the most importing to invest time in. As you need to make sure you name and tag your feature groups consistently if other Data Scientists are to use and understand them.
- Shared account and access controls: AWS SMFS, like all AWS services, uses IAM for controlling access and KMS to support encryption. In infract with AWS SMFS, you must encrypt at rest (there is no option not to) and data access primary via APIs using HTTPS. AWS SMFS provides access controls down to the Feature Group level and you can use Tags to limit access to Features Groups used within production systems.
- Manual Updates: AWS SMFS provides three ways to ingest data; the Data API using Put Record method, EMR Spark Job and Amazon SageMaker Data Wrangler. Amazon SageMaker Data Wrangler is going to be probably most useful for a Data Scientist and allows you write custom Pandas code. To do this you need create a flow, import your data from a file (parquet) or Athena, and then copy-and-pasting your code into a Custom Transform. However, for very big data sets you’re going to have to use EMR.
- Regular Updates: There is currently no in-built capability in AWS SMFS to trigger or run transforms. Therefore, you will need your need to implement them yourself using other AWS services like in the ‘Build Your Own’ architecture.
- Real-Time Updates: AWS SMFS has full support for real-time updates using the Put Record API. However, the Put Record API supports only one record at a time and is synchronised. This will mean it needs to be extremely fast and can be integrated with Kafka or Kinesis. However, you cannot apply any transforms to the messages before or after it reaches AWS SMFS. A workaround is going via a Lambda Function first.
- ML Ops / Pipelines: In this architecture, AWS SMFS is usable outside of SageMaker studio via APIs so that you can integrated it as part of your own MLOps processing and inference.
In my opinion AWS SMFS is at the MVP stage and I would expect to revisit it in a year’s time and see that is matured. Some things I wish AWS would address are in the next year are:
- Multi-Tenancy: At Inawisdom we have clients that run multi-tenant data lake and/or solutions for people. For these use cases, I love to see row-level access control (as seen in Lake Formation) and row-level encryption with the ability to use different KMS keys per tenant.
- Discovery in Studio: Currently the main issues are discovery in the multi-account approach. I am yet to find a way from Amazon SageMaker Studio to see a Feature Store in another account. I can access at an API level but not from the nice UI in Studio and I feel this would impair Data Scientists. Therefore a “Shared Feature Stores” would be amazing, this more a common complaint with using Studio in an multi-account setup than AWS SMFS.
- Non-Structured Data: AWS SMFS does not currently really support images, audio files and videos as data types. You can work around this if you base64 encode the binary files. However, this leads to them not being easily viewable.
- Monitoring + Trigger: AWS SMFS does not have the ability to monitor your feature in terms of upstream changes or time-based refreshing of data. Therefore, you cannot trigger a pipeline to refresh the feature if the source data has changes or trigger retraining of models that uses your feature group. To do this currently you have to use Amazon SageMaker Model Monitor, Event Bridge and StepFunctions.
- Model Dependency: AWS SMFS does not deeply integrate with Amazon SageMaker Model Register. This means that to enrich your Catalogue with what models use which features, you would need use tags in AWS SMFS.
- Ingestion: All three ingestion types seem a little burdensome if you already have a Pandas script and parquet files in S3. I would love the ability to import or govern files in S3. Also, the ability to query Redshift or load data from Redshift.
- Versioning: AWS SMFS does not have its own versioning ability currently, therefore you must use a combination of naming convention and tagging to implement this. I think this is a little risky if you do not have all the controls in place as someone could inadvertently update a one-line feature group and you could not revert it. So, make sure a Tag and IAM policy is used to safeguard against human error.
- Encoding: One of the biggest pain points I have experienced in ML engineering is the handling of the encoding of features and logic needed to do that. A good example of this a Standard Scaler in Skit Learn. You need to fit it on your training data and then apply it at inference. Inference may have a value for the feature that is outside of the training test data, so its standard deviation is computed.
To do this currently I save a copy of the scaler from pre-processing or training to S3 and then reuse at inference. AWS SMFS does not support storing this scaler or recreating it. AWS SMFS only stores data currently.
As you scale your Machine Learning initiatives, it’s important to consider how you will treat Features as strategic and valuable assets. Using a Feature Store helps you achieve this, and this blog has shown you some of the Feature Store options available on AWS for you to use.
Both the Build-Your-Own and Amazon SageMaker Feature Store approaches are very similar. I see Amazon SageMaker Feature Store as an MVP of the existing established pattern for implementing a Feature Store in an centralised account on AWS (aka the same solution that Build-Your-Own is). Importantly however AWS have sorted the security and the cataloguing of features for you.
These are hard things to implement and get right that require lots of experience of doing Machine Learning at scale, experience that Inawisdom and AWS have. Therefore, if you need help with building your own Feature Store or you need some of the advanced capabilities I mention in my last blog, Inawisdom is here to help you.