feast icon indicating copy to clipboard operation
feast copied to clipboard

Offline stores should support multiple data sources

Open felixwang9817 opened this issue 4 years ago • 11 comments

Offline stores currently only support their corresponding data source. For example:

class BigQueryOfflineStore(OfflineStore):
    def pull_latest_from_table_or_query(...):
        assert isinstance(data_source, BigQuerySource)
        ...

Users have asked about allowing offline stores to support multiple different data sources. This issue tracks that user request.

felixwang9817 avatar Dec 08 '21 01:12 felixwang9817

@felixwang9817 do you have more details on this request? Specifically what problem is being solved?

woop avatar Dec 11 '21 22:12 woop

@woop the specific user request is here.

felixwang9817 avatar Dec 14 '21 20:12 felixwang9817

some thoughts from the team after a discussion:

there are two orthogonal issues here:

  1. supporting multiple kinds of data sources per offline store (e.g. Athena on S3 + Redshift in a single offline store)
  2. supporting multiple kinds of data sources residing in different offline stores (e.g. some data in Snowflake, some data in Postgres) in this particular case, the user request is of type (2)

both requests are very reasonable, but also not very common, so we won't be prioritizing either of them - however, we do have some general thoughts on implementation, and community members should feel free to take a pass at this

for (2), i.e. supporting multiple kinds of data sources residing in different offline stores, we could extend feature_store.yaml to support multiple offline stores. one issue we foresee is joining data across offline stores, so we could limit historical retrieval to only retrieve features from the same offline store. in the future, Trino could be an option to query across offline stores

felixwang9817 avatar Dec 17 '21 06:12 felixwang9817

I have a request for type (2). We connect to up to 3 different data sources. Normally it's a users table, an items table, and an events table (containing the interactions between users and items). These 3 tables most of the time live in the same datasource type but sometimes the tables are stored in different places (Postgres, BigQuery, Firebase, etc.)

shaibruhis avatar Apr 08 '22 05:04 shaibruhis

@felixwang9817 @woop Created #3168. We would like to be able to query for features from multiple data sources at scale. We would like to use Spark for that. Since Spark offline store is not really a store but computation engine which can connect to pretty much any source (you just need to provide jdbc driver and connection string) it would fit this ideally. With that Feast would have general purpose engine to get features from multiple different sources at scale. You wouldn't have to maintain growing list of offline stores and data sources. Feast team could focus resources on Spark and add more functionality going forward instead of keep on adding new offline stores and data sources. Let me know your opinions.

ckarwicki avatar Sep 06 '22 01:09 ckarwicki

+1

dbg-raghulkrishna avatar Sep 08 '22 17:09 dbg-raghulkrishna

@ckarwicki @dbg-raghulkrishna ultimately you need to get all the data into one place if you want get_historical_features() to work as you need to bring together multiple feature views. For materialize(), this might work abit easier as each feature_view doesnt need to interact with others.

Feast assumes that the features you are registering are quote "ready for modeling" -- why silo those features in different stores and then rely on the feature store to unsilo them? feast really doesnt havent an opinion on your feature pipeline architecture.

sfc-gh-madkins avatar Dec 30 '22 02:12 sfc-gh-madkins

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 20 '23 16:05 stale[bot]

bump

On Sat, May 20, 2023 at 11:37 AM stale[bot] @.***> wrote:

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

— Reply to this email directly, view it on GitHub https://github.com/feast-dev/feast/issues/2121#issuecomment-1555948176, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATSRCU5HB5WUXHFSCMEAUVTXHDXK7ANCNFSM5JSSODFA . You are receiving this because you commented.Message ID: @.***>

sfc-gh-madkins avatar May 20 '23 16:05 sfc-gh-madkins

Can we re-open the issue? It is very common to have data across multiple databases especially if the company is not small and have more than just 1 product.

svenaoki avatar Jun 13 '23 11:06 svenaoki

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Oct 15 '23 16:10 stale[bot]