pdr-backend icon indicating copy to clipboard operation
pdr-backend copied to clipboard

[Lake][ETL] DuckDB E2E - Ingestion -> Dashboards

Open idiom-bytes opened this issue 1 year ago • 1 comments

Motivation

We completed the techspikes around our data infrastructure. As an outcome, we're going to implement our ETL pipeline in DuckDB such that we maintain many of our constraints:

  • We can test everything end-to-end (pytest, duckdb, plots)
  • We can reduce requirements on servers/infra/ops by using in-memory, embedded, on-disk db (duckb)
  • We can continue doing distributed computing (ray)
  • DB store can grow elastic w/ lvm, efs, filestore

Outline

Our first goal is to take the current ETL workflow and update it end-to-end.

Screenshot from 2024-02-29 13-51-46

Shelved Deliverables

CLOSED TICKET - Add ETL checkpoint to enforce SLAs, and process data incrementally. #694

Reason: We're going to instead implement a build step that leverages a simple SQL strategy w/ temp tables, such that we can enforce SLAs in a clean manner.

DoD

[First Deliverable - Update Ingestion + Load]

  • [x] Update GQL local/save-to-disk to use csv #681
  • [x] Fetch + insert new records into db #682

[Core System Updates]

  • [x] Improve ETL by using SQL build strategy & temp tables to enforce SLAs #771

[Update ETL Deliverables]

  • [x] DuckDB - Update existing logic that queries data from parquet using polars, to query_data from PersistentDataStore using SQL.
  • [x] DuckDB - Provide unifying etl_ view for ETL build steps. Downstream bronze and silver tables require data from both live_ and build_ tables. #810
  • [x] DuckDB - Update ETL "bronze step" to use duckdb/sql #738
  • [x] DuckDB - Clean up any final queries that fetch the whole database system. No select * from table_name anywhere. #809
  • [x] DuckDB - Create logic to support an etl_view that joins both live + temp tables, making it easier to query data as it's being built #810
  • [x] DuckDB - Port latest ETL "bronze slots" to use duckdb/sql #740
  • [x] DuckDB - Extend test coverage to simulate multiple runs, and cover more edge cases. #811
  • [x] DuckDB - Abstract [Prod->Temp->View] pattern #881
  • [x] Accuracy - Update accuracy endpoint to use the ETL to get and store the data. #854

[ETL CLI Deliverables]

  • [x] Update ETL CLI and provide a clean interface that helps enforce SLAs. #703
  • [x] Create pdr analytics describe, query, validate, resume CLI command #883

[Cleanup Deliverables]

  • [x] Cleanup DuckDB implementation, remove hanging dependencies, remove table.df caching, switching from parquet to CSV, cleanup tests. #737
  • [x] Clean truevalue and predvalue #664
  • [x] Improve CSVDataStore and PersistentDataStore instantiation inside Table #773
  • [x] Cleanup Tables, query them directly through CSVDataStore + PersistentDataStore #772
  • [x] Remove reference to parquet_dir use lake_dir instead #770
  • [x] PDS thread tests are failing, although CI/CD passes. #941
  • [x] OHLCV + CSVDS has been reconciled #858
  • [x] Lake cli command to update is implemented #953
  • [x] Fix issue where CSV data exists, but raw table won't be recreated #1038
  • [x] Fix pdr-slots and quey logic that's causing the pipeline to break #1036
  • [x] Time functions are returning local timestamp and breaking the pipeline #1070
  • [x] Use max(values) to get the max timestamp across all duckdb tables #1049
  • [x] Expand validation tool to check/report duplicate rows #1058
  • [x] Update GQLDF to use the same end_ts for all subgraph queries. #1068
  • [x] Stabilize Lake Validation & CLI Commands #1036
  • [x] Fix drop logic to use the proper timetamp, and not drop the whole table #1079
  • [ ] Verify lake behavior is working as expected #1000
  • [ ] Readme and UX for working the lake is documented and works well end-to-end #1002

[Post-DuckDB Merge] These are frozen. Do not start/complete until DuckDB review/work is complete.

  • [x] Streamlit - Integrate predictoor_income work and dashboards #612
  • [x] Update OHCLV Data Factory to use DataStores #769
  • [ ] Re-enable slots and subscription raw tables
  • [ ] Re-enable bronze_slots tables
  • [ ] Verify incremental table updates are working as expected #1001
  • [ ] DuckDB - Bronze slots table - #595
  • [ ] DuckDB - Implement "silver predictions" #665
  • [ ] DuckDB - Port latest ETL "silver predictions" to use duckdb/sql + close old PR #741
  • [ ] DuckDB - Silver predictions SQL PR #848
  • [ ] DuckDB - Use bronze-predictions for checkpoint #982
  • [ ] Data Store Objects - Please rename functions to use sql nomenclature: fill becomes insert, override becomes upsert
  • [ ] OHLCV + CSVDS will be updated after DuckDB has been updated #769
  • [ ] Fix incremental pipeline by using bronze_predictons as the checkpoint #982

idiom-bytes avatar Feb 27 '24 15:02 idiom-bytes

Discovered some issues related to data fetching on the main branch, because multiple things are rewritten in the ETF flow I leave them here in case they got solved along the way, if not maybe here would be the place to solve them:

  1. Column don't match: "timestamp" and "tx_id" - solved if file is deleted Screenshot 2024-04-02 at 10.05.49.png

  2. Error on saving data - solved if the gql command is rerun Screenshot 2024-04-02 at 10.12.45.png

KatunaNorbert avatar Apr 02 '24 08:04 KatunaNorbert

Simply integrate the accuracy code when it's ready to go...

Do not modify the lake when using old code... the idea was to just use the old code, which is what we'll be doing... https://github.com/oceanprotocol/pdr-backend/issues/1102 https://github.com/oceanprotocol/pdr-backend/pull/1103/

Please do not develop these further.

idiom-bytes avatar May 30 '24 20:05 idiom-bytes

Closing this ticket and moving remaining items to some other ticket so i can reconcile outstanding issues.

idiom-bytes avatar Jun 25 '24 14:06 idiom-bytes