cockroach backup: high request rate to Meta2 range during backup

Describe the problem

In a large scale test cluster, the hourly backup appears to cause an increase in sql latencies on 1 node:

This appears correlated with a node getting a large number of scan requests:

We note that this node is also the range 1 leaseholder, range 1 holding the range descriptors. The backup scans the range descriptors so that it can better align export requests to range boundaries:

https://github.com/cockroachdb/cockroach/blob/838bcb171017937a827d5aab1eba9fc4f31732c6/pkg/backup/backup_processor.go#L309-L344

Jira issue: CRDB-51216

Epic CRDB-51482

Jun 04 '25 10:06 stevendanna

Hi @stevendanna, please add branch-* labels to identify which branch(es) this C-bug affects.

_{:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Jun 04 '25 10:06 blathers-crl[bot]

cc @cockroachdb/disaster-recovery

Jun 04 '25 10:06 blathers-crl[bot]

@jeffswenson Asks a good question which is why the Meta2 range hasn't split.

In the logs we see a steady stream of:

error during range split: unable to split [n179,s357,r1/146:‹/{Min-System/NodeL…}›] at key /Meta2/Table/114/4/‹3599815›/‹10›/‹-1197›/‹10›: ‹could not find valid split key›

Jun 04 '25 14:06 stevendanna

Possibly related: https://github.com/cockroachdb/cockroach/issues/119421

Jun 04 '25 15:06 stevendanna

We manually added a split at /Meta2 to allow load-based splitting to take over. After several hours of hourly backups, this has resulted in 3 splits in Meta2:

  range_id |                                          start_pretty                                           |                                           end_pretty
-----------+-------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------
         1 | /Min                                                                                            | /Meta2/Table/114/1/691272/2/-2161/3
   1351850 | /Meta2/Table/114/1/691272/2/-2161/3                                                             | /Meta2/Table/114/4/3477354/9/-497/5
   1360818 | /Meta2/Table/114/4/3477354/9/-497/5                                                             | /System/NodeLiveness

While this does help, it still isn't enough to complete avoid the top-of-the-hour spikes in sql latency on these ranges:

Perhaps we need some improvements related to how the splitter tracks loads on this type of bursty workload.

I'll also note that the range iterator we are using doesn't set an admission control header. Perhaps we need the ability for the caller to pass its own header so that uses like those in backup can be admission controlled.

Jun 05 '25 08:06 stevendanna

I think there is broadly four things we can try here:

Adjust the priority of the requests so that they enter elastic admission control.
Interleave range lookups with sst export requests to spread them out over time. We would also need to introduce a slight skew in processor startup to avoid the initial burst ariving all at once.
Send larger scans to minimize overhead per RPC.
Somehow ensure that the number of META2 ranges scales with the number of nodes in the system (maybe one for every 10 nodes or something like that).

Jun 05 '25 14:06 jeffswenson

@msbutler @stevendanna - is this work in backup team or KV ? Which release is have planned l to schedule this?

cc: @alicia-l2

Jun 18 '25 15:06 dshjoshi

as i mentioned to you in chat, this is tentatively kv, but DR might pick it up. we hope to get this in in 25.4. the jira label has already been applied.

Jun 18 '25 15:06 msbutler

@msbutler Thanks! - I noticed the jira label of 25.4 was some how not applied - so I went ahead and applied it.

Jun 20 '25 15:06 dshjoshi

Dupe of https://github.com/cockroachdb/cockroach/issues/148447. Closing in favor of the open issue on the KV team.

Sep 23 '25 18:09 rimadeodhar