Offline stores should support multiple data sources
Offline stores currently only support their corresponding data source. For example:
class BigQueryOfflineStore(OfflineStore):
def pull_latest_from_table_or_query(...):
assert isinstance(data_source, BigQuerySource)
...
Users have asked about allowing offline stores to support multiple different data sources. This issue tracks that user request.
@felixwang9817 do you have more details on this request? Specifically what problem is being solved?
@woop the specific user request is here.
some thoughts from the team after a discussion:
there are two orthogonal issues here:
- supporting multiple kinds of data sources per offline store (e.g. Athena on S3 + Redshift in a single offline store)
- supporting multiple kinds of data sources residing in different offline stores (e.g. some data in Snowflake, some data in Postgres) in this particular case, the user request is of type (2)
both requests are very reasonable, but also not very common, so we won't be prioritizing either of them - however, we do have some general thoughts on implementation, and community members should feel free to take a pass at this
for (2), i.e. supporting multiple kinds of data sources residing in different offline stores, we could extend feature_store.yaml to support multiple offline stores. one issue we foresee is joining data across offline stores, so we could limit historical retrieval to only retrieve features from the same offline store. in the future, Trino could be an option to query across offline stores
I have a request for type (2). We connect to up to 3 different data sources. Normally it's a users table, an items table, and an events table (containing the interactions between users and items). These 3 tables most of the time live in the same datasource type but sometimes the tables are stored in different places (Postgres, BigQuery, Firebase, etc.)
@felixwang9817 @woop Created #3168. We would like to be able to query for features from multiple data sources at scale. We would like to use Spark for that. Since Spark offline store is not really a store but computation engine which can connect to pretty much any source (you just need to provide jdbc driver and connection string) it would fit this ideally. With that Feast would have general purpose engine to get features from multiple different sources at scale. You wouldn't have to maintain growing list of offline stores and data sources. Feast team could focus resources on Spark and add more functionality going forward instead of keep on adding new offline stores and data sources. Let me know your opinions.
+1
@ckarwicki @dbg-raghulkrishna ultimately you need to get all the data into one place if you want get_historical_features() to work as you need to bring together multiple feature views. For materialize(), this might work abit easier as each feature_view doesnt need to interact with others.
Feast assumes that the features you are registering are quote "ready for modeling" -- why silo those features in different stores and then rely on the feature store to unsilo them? feast really doesnt havent an opinion on your feature pipeline architecture.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
bump
On Sat, May 20, 2023 at 11:37 AM stale[bot] @.***> wrote:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/2121#issuecomment-1555948176, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSRCU5HB5WUXHFSCMEAUVTXHDXK7ANCNFSM5JSSODFA . You are receiving this because you commented.Message ID: @.***>
Can we re-open the issue? It is very common to have data across multiple databases especially if the company is not small and have more than just 1 product.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.