FlowKit icon indicating copy to clipboard operation
FlowKit copied to clipboard

Enable invalidating cache table by data date

Open jc-harrison opened this issue 3 years ago • 0 comments

Sometimes it is necessary to re-ingest data for a historic date (e.g. if backfill CDR data are received for a date for which data were previously incomplete). In this situation, we want to invalidate cache tables derived from this day of data, as the result of the corresponding query may now have changed. However, it would be preferable not to invalidate all cache tables, because most probably did not consume data from the date in question and therefore most cached results will usually still be valid.

It would be useful to be able to invalidate cached results by data date (ideally this invalidation step would then be incorporated into FlowETL CDR ingestion DAGs). I think the most straightforward solution would be for EventTableSubset queries to include Tables for the relevant date partitions in its dependencies - then one could "invalidate the cache" for a date partition and cascade up to all cache tables depending on this date.

Something to watch out for is if a query consumes data from a date range within which there is a missing day, and data for the missing day later get ingested. In this situation the associated cache table would not have a dependency on the newly-added date partition, because the partition didn't exist when the query was run. I think in this situation the preferable solution would be to ensure that the query ID for the relevant EventTableSubset query changes when the new date is added, so that the previous cache tables can be kept (could be re-used if a user specifically re-runs the query excluding the previously-missing date) but are not re-used when re-running the query for the same date range. I expect this would fall out naturally in the solution described above, but it's something to look out for nonetheless.

jc-harrison avatar Jul 11 '22 14:07 jc-harrison