BigQuery OOM for large datasets
Expected Behavior
No failures when running historical retrieval with BigQuery
Current Behavior
Users are experiencing OOM errors with BigQuery when running commands like
df = store.get_historical_features(
entity_df="""
SELECT
user_id,
driver_id,
TIMESTAMP '2021-03-21 16:01:21' AS observed_at,
FROM production.table
LIMIT 1000000
""",
feature_refs=features
).to_df()
which leads to
raise self._exception
google.api_core.exceptions.BadRequest: 400 Resources exceeded during query execution: The query could not be executed in the allotted memory. Peak usage: 134% of limit.
Top memory consumer(s):
sort operations used for analytic OVER() clauses: 100%
From the user
because the original dataset that contains 7B+ unique identifiers (which means that some FeatureViews of it can have 30B+ rows)
Steps to reproduce
It's necessary to run a get_historical_features() call that queries feature views with a large amount of rows. It's not clear what the exact size of the table should be before an OOM exception is thrown.
Possible Solution
The original point-in-time join query for BigQuery was written in Feast 0.3 and used up to 0.7. The original query also suffered from OOM problems, but those problems were resolved. A possible reason why the current query doesn't work is because there are slight differences in how this query is executed compared to Feast 0.7.
- We are adding unique Ids to entity rows in the entity table. We can use Use GENERATE_UUID() instead of ROW_NUMBER() since it doesn't require the whole dataset to be collected on a single node.
- We can also compute the unique Ids a single time as part of creating the entity table, instead of doing it as part of a WITH clause.
- The original BigQuery query did not join all feature views to the entity dataframe in a single query. It joined all feature views in separate queries which produced temporary tables. Once these tables had gone through point-in-time joins, their row count is significantly reduced (to the size of the entity table), after which they can be joined together safely using a single query. We would just need to separate out this query and make sure that the for loop here creates new temp tables instead of keeping all data in memory.
- An easy optimization is to filter the feature view based on the outer bounds of date ranges of the entity dataframe. We should never be able to query further back than the first date of the entity dataframe (minus TTL).
Context: The reason we run the complete historical retrieval in a single query at the moment is because it doesn't require write permissions from Feast users in BigQuery. Once we move to creating temp tables, we will be forced to ask users to have write permissions.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.