modern-data-warehouse-dataops
modern-data-warehouse-dataops copied to clipboard
Parking Sensors: Pipeline Replayability?
One of the core ideas in this project is that pipelines should be replayable.
In the current pipeline setup it appears that a single pipeline controls the ingestion, standardization, and transformation of the data.
- In the worst case that there is a bug in the standardization notebook, how does this pipeline facilitate replayability? If we re-run the pipeline it re-ingests new data and only processes that.
- If we do historical data tracking how do we re-run the standardization and transform steps on all historical data in the order that it was received?
Maybe I'm looking too much into it, but shouldn't we be able to replay the standardization and transformation steps from scratch based solely on the the data that already exists in the data lake (and processing them in the order received), without ingesting new data?