persist: Write structured columnar data to S3
This PR adds a V0 implementation of writing structured columnar data to S3.
To start it introduces two dyncfgs to control this:
-
persist_batch_columnar_format- controls the format we'll use to write batches, eitherRowwhich is the same as the current behavior, orBothwhich writes row data and structured columnar format. -
persist_part_decode_Format- controls how we decode parts, current options areRow { validate_structured: bool }. We will always decodeRowdata, but we will optionally decode and validate the structured data.
Using these two dyncfgs we update the Arrow/Parquet handling to optionally encode and decode two new k_s and v_s columns which are structured versions of the existing k and v columns.
If structured encoding is turned on, we use the columnar Part we create for collecting statistics and instead of throwing it away we write it to S3. So the write path shouldn't become any more expensive. If structured decoding is enabled then we will optionally decode a K and V from our structured data, check that it matches what we decode from the Row data, and record the result of that check in a prometheus metric for monitoring.
Tests
I updated the testdrive mzcompose framework to support setting dyncfgs and addded a new nightly test slug that runs all of testdrive with structured data enabled.
Motivation
Progress towards https://github.com/MaterializeInc/materialize/issues/24830
Tips for reviewer
This PR is split up into logical commits that can be reviewed independently:
- Adds
dyncfgs and their plumbing - Updates our Arrow and Parquet encoding to optionally support structured data
- Update the read and write paths for a Part to use structured data, if the
dyncfgs are enabled. - Lint fixes
Checklist
- [ ] This PR has adequate test coverage / QA involvement has been duly considered.
- [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
- [ ] If this PR evolves an existing
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
- [x] This PR includes the following user-facing behavior changes:
- Adds a feature gated ability to write structured columnar data.
Mitigations
Completing required mitigations increases Resilience Coverage.
- [x] (Required) Code Review
🔍 Detected - [x] (Required) Feature Flag
- [x] (Required) Integration Test
🔍 Detected - [x] (Required) Observability
🔍 Detected - [ ] (Required) QA Review
- [ ] (Required) Run Nightly Tests
- [x] Unit Test
🔍 Detected
Risk Summary:
The risk associated with this pull request is high, with a score of 83, indicating a significant likelihood of introducing a bug. Historically, pull requests sharing these characteristics have been 127% more likely to result in bugs compared to the repository baseline. This risk is influenced by the average line count in files and the number of executable lines within files. Additionally, the pull request modifies 2 files that have recently seen a high number of bug fixes, which may contribute to the risk. The repository's observed bug trend is currently increasing, separate from the predicted bug trend.
Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.
Bug Hotspots: What's This?
| File | Percentile |
|---|---|
| ../src/cfg.rs | 91 |
| ../src/lib.rs | 91 |
Happy to see the testdrive run, but we should make sure to ping testing team, and it'd be pretty interesting to see how nightly behaves with this fully enabled before merging.
The testdrive run currently fails actually, there is a panic inside of arrow2 that I need to dig into, not sure if it's our usage or a bug in the crate. Regardless I will also reach out to the testing team to get additional coverage here.
moving this to draft while I wait for https://github.com/MaterializeInc/materialize/pull/27009 to merge
I'd love to see timing coverage for the various encode/decode/validate things
Added! Currently the metrics are a little wonky because we track the amount of time it takes to encode an entire Part, but when decoding we track how long it take to decode a single item. I can push the encoding measurement down to the individual row, let me know!
probably also some amount of counts of the major paths, so we can see the dyncfgs taking effect (and that we're not hitting some unexpected more-than-one-row-group case, causing us to fall back to the legacy behavior)
Added counts for encoding a Part, decoding a Row, validation, and number of row groups in Parquet.
would be cool if we could get some sense of the size overhead, but not sure if there's a good way to do that without encoding the parquet representation twice. probably not that interesting until we do some non-Plain compression anyway.
We can get this info from the FileMetaData returned from Parquet, added size metrics for all of our top-level columns
Ugh, sorry to pile on one more thing, but I was thinking last night and realized we probably want to be able to tell from metadata which batches have the old format and which have the new format. Might be worth doing that from the start? I'd certainly do it before we turn it on more widely in staging, if not in this PR
(maybe ensure that each batch is all one or the other so we don't have to write it down per part?)
No worries! Do you mind if I make this change in a follow up PR? This one has already been open long enough that I would like to get it merged. Definitely won't turn this on in Staging (other than my own environment possibly) until making this change though!
@bkirwi @danhhz this should be ready for a re-review when you get the chance! Apologies for having force pushed and thus losing the previous reviews progress. The core logic didn't change, other than re-writing the Arrow/Parquet bits with arrow-rs and parquet-rs, and the majority (all?) of the new changes are around metrics and testing.
\o/
Merged to make this week's release, more than happy to follow up with any requested changes :)
None from me, lgtm!