java-driver icon indicating copy to clipboard operation
java-driver copied to clipboard

Tablets 4.x

Open Bouncheck opened this issue 2 years ago • 7 comments

(Copied over from commit message) Introduces tablets support for version 4.x of the driver. Metadata about tablets will be kept in TabletMap that gets continuously updated through the tablets-routing-v1 extension. Each time the PreparedStatement or BoundStatement targets the wrong node and shard combination the server supporting tablets should respond with tablet metadata inside custom payload of its response. This metadata will be transparently processed and used for future queries.

Tablets metadata will by enabled by default. Until now driver was using TokenMaps to choose replicas and appropriate shards. Having a token was enough information to do that. Now driver will first attempt tablet-based lookup and only after failing to find corresponding tablet it will defer to TokenMap lookup. Since to find a correct tablet besides the token we need the keyspace and table names, many of the methods were extended to also accept those as parameters.

RequestHandlerTestHarness was adjusted to mock also MetadataManager. Before it used to mock only session.getMetadata() call but the same can be obtained by context.getMetadataManager().getMetadata(). Using the second method was causing test failures.

Bouncheck avatar Aug 21 '23 09:08 Bouncheck

@Bouncheck can you please describe the current status of it? Where we're in terms of code completion / testing / what's left / any estimation for completion?

Also, if review is already required or not and if any blockers exist at the moment.

roydahan avatar Mar 10 '24 13:03 roydahan

Currently I've made almost complete switch to the solution that uses custom payloads instead of querying system.tablets. I've ran some tests on local 4 node cluster and so far there were no issues with reading the payloads and selecting the correct replica node. The only thing current code lacks is choosing a shard correctly. I think the code will be complete during this week.

Bouncheck avatar Mar 11 '24 10:03 Bouncheck

@Bouncheck can you please update what’s the status and what’s this PR waiting for?

roydahan avatar Mar 28 '24 14:03 roydahan

Currently it's waiting for review. I've done some manual testing, but haven't done any stress testing under heavy load.

Bouncheck avatar Mar 30 '24 00:03 Bouncheck

Thanks @Bouncheck, can you please check the failing CI check? Also, please ping someone from the team to help you with a peer review.

Regarding stress testing, I don't know if we have a tool/app that utilizes this driver version, IIRC there was a problem to build c-s with this driver version. However, if it's not the case anymore or we can utilize newer c-s from the upstream that we can build with this driver, we can have a longevity test using SCT.

roydahan avatar Mar 31 '24 17:03 roydahan

Will do. At first i thought 2024.1.2 failures were a one off, since it looked liked there's a lot of similar test errors relating to starting the cluster (and only for that scylla version), but it seems they reproduce after all.

Bouncheck avatar Apr 02 '24 14:04 Bouncheck

@Bouncheck what's the status of this PR?

@Lorak-mmk can you please help reviewing this?

roydahan avatar Apr 25 '24 11:04 roydahan

ping I see review comments from @Lorak-mmk from ~3 weeks ago that weren't discussed.

bhalevy avatar May 21 '24 07:05 bhalevy

Yes, I've seen them and I'm reworking those parts. Should be ready soon

Bouncheck avatar May 21 '24 08:05 Bouncheck

I'll push the code likely today - I want to run 1 more concurrency experiment.

Bouncheck avatar May 28 '24 12:05 Bouncheck

@Bouncheck is it ready for re-review? If so, please request @avelanarius and @Lorak-mmk reviewing it ASAP.

roydahan avatar Jun 02 '24 15:06 roydahan

@dani-tweig - this important PR is not tracked anywhere - there's no 'tablets' label for it, etc.

mykaul avatar Jun 05 '24 08:06 mykaul

Pushed another version:

  • Removed redundant/unused fields and collections from DefaultTablet
  • Added NonNull annotation to pair structures
  • Removed NodeShardPair altogether
  • Added comment about thread safety to DefaultTabletMap
  • Fixed comments about sweep during tabletAdd in DefaultTabletMap
  • Removed unused elements from tests
  • Removed inferTable, added getRoutingTable in Request; Provided overrides for BatchStatement and BoundStatement. Ultimately I did not add setRoutingTable or routingTable fields. I do not have a usecase for them right now and this saves me from adjusting statement builders and tests.
  • Removed unnecessary nullity checks and other misc. items pointed out
  • Reduced 1 partitioner.hash(key) call in BasicLoadBalancingPolicy in case when falling back to TokenMap
  • Switched to CqlIdentifiers instead of Strings when handling keyspace and table in tablet contexts

Bouncheck avatar Jun 10 '24 08:06 Bouncheck

I see CI is failing, do you know why? Also, I remember you mentioned that Tablet tests were not being executed, because of some edge case with version comparing. Is hat fixed?

Lorak-mmk avatar Jun 10 '24 10:06 Lorak-mmk

Last time I checked this test these same two methods (should_apply_node_filter, should_use_round_robin_on_local_dc_when_not_enough_routing_information) were failing and it was due to one node failing to start. This changes actual number of nodes seen in local dc and changes the output in second method to different than expected because round robin does not include that downed node.

The problem with tablets integration test was that the CI used to pull release candidate version for which I've noticed it was skipping this test. It's pulling 6.0.0 now which is the minimal required version and if you search the logs for DefaultMetadataTabletMapIT you can see that it indeed does run the test and launch 2 nodes it's configured for.

Bouncheck avatar Jun 10 '24 10:06 Bouncheck