Tablets 4.x
(Copied over from commit message) Introduces tablets support for version 4.x of the driver. Metadata about tablets will be kept in TabletMap that gets continuously updated through the tablets-routing-v1 extension. Each time the PreparedStatement or BoundStatement targets the wrong node and shard combination the server supporting tablets should respond with tablet metadata inside custom payload of its response. This metadata will be transparently processed and used for future queries.
Tablets metadata will by enabled by default. Until now driver was using TokenMaps to choose replicas and appropriate shards. Having a token was enough information to do that. Now driver will first attempt tablet-based lookup and only after failing to find corresponding tablet it will defer to TokenMap lookup. Since to find a correct tablet besides the token we need the keyspace and table names, many of the methods were extended to also accept those as parameters.
RequestHandlerTestHarness was adjusted to mock also MetadataManager.
Before it used to mock only session.getMetadata() call but the same can
be obtained by context.getMetadataManager().getMetadata(). Using the
second method was causing test failures.
@Bouncheck can you please describe the current status of it? Where we're in terms of code completion / testing / what's left / any estimation for completion?
Also, if review is already required or not and if any blockers exist at the moment.
Currently I've made almost complete switch to the solution that uses custom payloads instead of querying system.tablets. I've ran some tests on local 4 node cluster and so far there were no issues with reading the payloads and selecting the correct replica node. The only thing current code lacks is choosing a shard correctly. I think the code will be complete during this week.
@Bouncheck can you please update what’s the status and what’s this PR waiting for?
Currently it's waiting for review. I've done some manual testing, but haven't done any stress testing under heavy load.
Thanks @Bouncheck, can you please check the failing CI check? Also, please ping someone from the team to help you with a peer review.
Regarding stress testing, I don't know if we have a tool/app that utilizes this driver version, IIRC there was a problem to build c-s with this driver version. However, if it's not the case anymore or we can utilize newer c-s from the upstream that we can build with this driver, we can have a longevity test using SCT.
Will do. At first i thought 2024.1.2 failures were a one off, since it looked liked there's a lot of similar test errors relating to starting the cluster (and only for that scylla version), but it seems they reproduce after all.
@Bouncheck what's the status of this PR?
@Lorak-mmk can you please help reviewing this?
ping I see review comments from @Lorak-mmk from ~3 weeks ago that weren't discussed.
Yes, I've seen them and I'm reworking those parts. Should be ready soon
I'll push the code likely today - I want to run 1 more concurrency experiment.
@Bouncheck is it ready for re-review? If so, please request @avelanarius and @Lorak-mmk reviewing it ASAP.
@dani-tweig - this important PR is not tracked anywhere - there's no 'tablets' label for it, etc.
Pushed another version:
- Removed redundant/unused fields and collections from DefaultTablet
- Added NonNull annotation to pair structures
- Removed NodeShardPair altogether
- Added comment about thread safety to DefaultTabletMap
- Fixed comments about sweep during tabletAdd in DefaultTabletMap
- Removed unused elements from tests
- Removed
inferTable, addedgetRoutingTableinRequest; Provided overrides forBatchStatementandBoundStatement. Ultimately I did not addsetRoutingTableorroutingTablefields. I do not have a usecase for them right now and this saves me from adjusting statement builders and tests. - Removed unnecessary nullity checks and other misc. items pointed out
- Reduced 1 partitioner.hash(key) call in BasicLoadBalancingPolicy in case when falling back to TokenMap
- Switched to CqlIdentifiers instead of Strings when handling keyspace and table in tablet contexts
I see CI is failing, do you know why? Also, I remember you mentioned that Tablet tests were not being executed, because of some edge case with version comparing. Is hat fixed?
Last time I checked this test these same two methods (should_apply_node_filter, should_use_round_robin_on_local_dc_when_not_enough_routing_information) were failing and it was due to one node failing to start. This changes actual number of nodes seen in local dc and changes the output in second method to different than expected because round robin does not include that downed node.
The problem with tablets integration test was that the CI used to pull release candidate version for which I've noticed it was skipping this test. It's pulling 6.0.0 now which is the minimal required version and if you search the logs for DefaultMetadataTabletMapIT you can see that it indeed does run the test and launch 2 nodes it's configured for.