materialize icon indicating copy to clipboard operation
materialize copied to clipboard

storage: factor StorageCollections out of StorageController

Open aljoscha opened this issue 1 year ago • 4 comments

[!NOTE]
The first commits are fixes that are required to pass CI, I have opened separate PRs for them. You should only have to look at the last commit.

Preparatory work for https://github.com/MaterializeInc/materialize/issues/24845, where we want to introduce more concurrency to the Coordinator and Controllers.

The considerations/design are described in doc/developer/design/20240117_decoupled_storage_controller.md

This is an intermediate step where we factor a StorageCollections out of StorageController, and let the StorageController use it's interface instead of holding collections state/sinces itself.

One of the next steps is to change usage sites of StorageController to use their own handle to a StorageCollections, bypassing the StorageController for query-processing (PEEKS, SUBSCRIBE, etc.) code paths. This will let us introduce more concurrency in the Coordinator and do less work in the main Coordinator loop.

The important parts in this change are:

  • StorageCollections::new and StorageController::new: these closely mirror each other.
  • StorageCollections::create_collections and StorageController::create_collections, ditto!
  • A lot of the rest is "boilerplate", passing through calls to the internal StorageCollections, and using StorageCollections from the controller instead of using owned state.
  • The last interesting thing to lock at is the new BackgroundTask: it takes over the work that persist_handles::PersistReadWorker was doing before plus it continually listens for upper changes and forwards the since frontier of collections/their since handles.

Checklist

  • [ ] This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
  • [ ] This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • [ ] If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • [ ] This PR includes the following user-facing behavior changes:

aljoscha avatar May 13 '24 13:05 aljoscha

@nrainer-materialize I kicked of a nightly run already because this is quite a deep change for how the StorageController works

aljoscha avatar May 14 '24 14:05 aljoscha

Risk Score:82 / 100 Bug Hotspots:4 Resilience Coverage:50%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • [x] (Required) Code Review 🔍 Detected
  • [ ] (Required) Feature Flag
  • [ ] (Required) Integration Test
  • [x] (Required) Observability 🔍 Detected
  • [x] (Required) QA Review 🔍 Detected
  • [ ] (Required) Run Nightly Tests
  • [ ] Unit Test
Risk Summary:

The pull request carries a high risk, with a score of 82, indicating a significant chance of introducing bugs. This assessment is based on predictors such as the average age of files, cognitive complexity within files, and changes to executable lines. Historically, pull requests with similar characteristics are 117% more likely to cause a bug compared to the repository baseline. Additionally, 4 files modified in this pull request have a recent history of frequent bug fixes, which contributes to the risk. The repository's observed bug trend is on an upward trajectory, although this is not directly tied to the risk score.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots: What's This?

File Percentile
../sequencer/inner.rs 99
../src/lib.rs 100
../src/controller.rs 94
../controller/instance.rs 91

shepherdlybot[bot] avatar May 14 '24 17:05 shepherdlybot[bot]

@nrainer-materialize there is a regression in the feature benchmark, but I don't see how the change could only affect full outer joins. It might be a flake?

Other than that, nightly seems good? I did restart RQG dbt3-joins workload once because it timed out.

aljoscha avatar May 15 '24 09:05 aljoscha

@nrainer-materialize there is a regression in the feature benchmark, but I don't see how the change could only affect full outer joins. It might be a flake?

Let's trigger a retry and see...

nrainer-materialize avatar May 15 '24 09:05 nrainer-materialize

@guswynn thanks for the very thorough review, I pushed a lot of individual fixup commits that address your comments!

@teskje also pushed commits for your comments. And also, thanks!

aljoscha avatar May 23 '24 13:05 aljoscha

fixup commits look good!

guswynn avatar May 23 '24 17:05 guswynn