docs icon indicating copy to clipboard operation
docs copied to clipboard

Rolling upgrade in a multi-region cluster

Open jseldess opened this issue 5 years ago • 7 comments

Jesse Seldess commented:

In Slack, @holtrdan asked for the best practice when upgrading a multi-region cluster. Which is best?

  1. Upgrade all nodes in a region before moving on to the other regions.
  2. Distribute the upgrade across regions evenly (node in region 1 > node in region 2 > node in region 3 > node in region 1 > node in region 2 > etc.)

@BramGruneir suggested option 1. We should validate and add a note to our upgrade docs, e.g, here.

We should also define how application traffic is a part of this. For option 1, for example, @dbist mentioned that a customer he's working with has no application traffic in a region while upgrading that region.

Jira Issue: DOC-591

Epic DOC-11047

jseldess avatar Jul 14 '20 14:07 jseldess

@taroface and @johnrk for triage and prioritization.

Also cc @joshimhoff: How do we handle CC multi-region cluster upgrades?

@bdarnell, any opinions?

jseldess avatar Jul 14 '20 16:07 jseldess

I would also tend to go region-by-region, although I don't think it makes a lot of difference and I'm mainly basing this on the fact that orchestration tooling is more likely to facilitate region-by-region upgrades instead of spreading it out more evenly.

For option 1, for example, @dbist mentioned that a customer he's working with has no application traffic in a region while upgrading that region.

If you can drain all traffic from a region while upgrading, that's probably a good idea. But if you're using geo-partitioning, that won't really be an option since that region will still need to serve traffic for ranges that are pinned there.

bdarnell avatar Jul 14 '20 18:07 bdarnell

If you can drain all traffic from a region while upgrading, that's probably a good idea. But if you're using geo-partitioning, that won't really be an option since that region will still need to serve traffic for ranges that are pinned there.

that's a good point, this particular customer does not use geo-partitioning because some of their clusters are on core version.

dbist avatar Jul 14 '20 18:07 dbist

I would also tend to go region-by-region, although I don't think it makes a lot of difference and I'm mainly basing this on the fact that orchestration tooling is more likely to facilitate region-by-region upgrades instead of spreading it out more evenly.

We go one node at a time, starting with region 1, then onto region 2, etc.

If you can drain all traffic from a region while upgrading, that's probably a good idea. But if you're using geo-partitioning, that won't really be an option since that region will still need to serve traffic for ranges that are pinned there.

We don't do this on CC. We upgrade one node at a time; other nodes in the region keep serving.

joshimhoff avatar Jul 14 '20 18:07 joshimhoff

Relates to #5780.

taroface avatar Jul 14 '20 18:07 taroface

linville (mdlinville) commented: It sounds like the recommendation here is not to drain traffic in the region, but to go per-region and within a region, go per-node. If it’s working for CC, it seems like a safe recommendation. Is that correct? If so, I can get this recommendation into the docs.

exalate-issue-sync[bot] avatar Mar 10 '23 23:03 exalate-issue-sync[bot]

linville (mdlinville) commented: Bram Gruneir Coming back to this to see if the situation is still the same as in the description, whether it has been validated, etc? Any pointers?

exalate-issue-sync[bot] avatar Jun 22 '23 21:06 exalate-issue-sync[bot]