modern-data-warehouse-dataops icon indicating copy to clipboard operation
modern-data-warehouse-dataops copied to clipboard

Parking Sensors: Pipeline Replayability?

Open ericmichael opened this issue 4 years ago • 0 comments

One of the core ideas in this project is that pipelines should be replayable.

In the current pipeline setup it appears that a single pipeline controls the ingestion, standardization, and transformation of the data.

  1. In the worst case that there is a bug in the standardization notebook, how does this pipeline facilitate replayability? If we re-run the pipeline it re-ingests new data and only processes that.
  2. If we do historical data tracking how do we re-run the standardization and transform steps on all historical data in the order that it was received?

Maybe I'm looking too much into it, but shouldn't we be able to replay the standardization and transformation steps from scratch based solely on the the data that already exists in the data lake (and processing them in the order received), without ingesting new data?

ericmichael avatar Mar 29 '21 21:03 ericmichael