feast icon indicating copy to clipboard operation
feast copied to clipboard

Out of the box transformations for common dataset formats (e.g. TF records, Torch datasets etc.)

Open nialloh23 opened this issue 4 years ago • 2 comments

Is your feature request related to a problem? Please describe. Every time we load data from our online (e.g. Redis) & offline feature stores we typically have to convert the raw data into a format that's useable by our models. At the moment the most common data formats we use are TF records, Torch Datasets, numpy arrays, pandas. This isn't overly complicated but ends up being a lot of duplicated code on our services.

Describe the solution you'd like When fetching data from the feature stores, feast would have a thin layer of post-processing transformations that would enable us to load the data in the format we require (e.g. TF records). The user would be able to specify the datatype they would like to fetch the data in. Example:

dataset = client.get_historical_features(
    feature_refs=features,
    entity_rows=entity_df
    data_format=tf_record 
)

This is something I've seen other feature stores bake in and thought it was very useful (example video from hopworks). One thing I really like about this is that it creates an even stronger boundary between our data processing and model. By the time we go to use our features in our model they are ready to go in the format we need.

Describe alternatives you've considered

  1. We just do these data format transformations every time we fetch our features (what we do now)
  2. We would have to materialize duplicated versions of our features in the feature store in different formats

Additional context Ignore my ignorance if this isn't feasible with the current architecture or long term vision for the boundaries of what feast should and shouldn't do. Just thought it would be helpful to share some context on a feature that would be super useful for us in this feature management layer.

nialloh23 avatar Oct 29 '21 08:10 nialloh23

Hey @nialloh23, thanks for the super detailed feature request! get_historical_features currently returns a RetrievalJob, which can then be converted into pandas, numpy, or Arrow pretty easily (e.g. with the to_df method). Similarly, get_online_features returns an OnlineResponse, which has a to_df and a to_dict method. Would that work for your use case?

felixwang9817 avatar Oct 30 '21 00:10 felixwang9817

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 22 '22 03:09 stale[bot]