standards icon indicating copy to clipboard operation
standards copied to clipboard

[Standard] Stabilize node distribution standard

Open cah-hbaum opened this issue 1 year ago • 6 comments

Follow-up for https://github.com/SovereignCloudStack/standards/pull/524 The goal is to set the Node distribution standard to Stable after all discussion topics are debated and decided and the necessary changes derived from these discussions are integrated into the Standard and its test.

The following topics need to be discussed:

  • [ ] How is node distribution handled on installations with shared-control plane nodes (Kamaji, Gardener, etc) - see e.g. https://github.com/SovereignCloudStack/standards/pull/524#pullrequestreview-2122476212
  • [ ] What should be done about control-planes with e.g. 3 nodes containing 3 etcd members, which are only distributed on 2 physical machines (and similar scenarios)what to do about control-planes with e.g. 3 control plane nodes and 2 etcd nodes - see e.g. https://github.com/SovereignCloudStack/standards/pull/524#discussion_r1642411303
  • [ ] Where is the differentiation between Node distribution and things like High Availability or Redundancy? Should this standard only be a precursor for a `High Availability' standard? (more information under #579)
  • [ ] Should information about external etcd (https://kubernetes.io/docs/setup/production-environment/tools/kubeadm/ha-topology/#external-etcd-topology) be integrated here? (see https://github.com/SovereignCloudStack/standards/pull/524#discussion_r1642540079)

cah-hbaum avatar Jun 17 '24 11:06 cah-hbaum

Topic 1: How is node distribution handled on installations with shared-control planes nodes?

e.g. Kamaji, Gardener, etc

This question was answered in Container Call 2024-06-27:

  • Standard case kamaji: dedicated controlplane components with shared etcd, everything hosted in k8s (no dedicated nodes), etcd is deployed with antiaffinity (kube-scheduler tries to spread across nodes). Relation of the nodes to each other is unknown to k8s
  • Gardener:
    • Non-HA: single-replica controlplane (dedicated, but hosted in shared seed-cluster).
    • HA: (multiple replicas, hosted in seed-cluster but with awareness to tolerate zone failure or node failure) https://gardener.cloud/docs/guides/high-availability/control-plane/#node-failure-tolerance

For example, regiocloud supports the Node Failure Tolerance case but not the Zone Failure Tolerance.

cah-hbaum avatar Jun 24 '24 09:06 cah-hbaum

Topic 2: Differentiation between Node distribution and things like High Availability, Redundancy, etc.

I think to discuss this topic correctly, most of the wording/concepts need to be established first. I'm going to try and find multiple (if different) sources and link them here for different things.


High Availability

The main goal of HA is to avoid downtime, which is the period of time when a system, service, application, cloud service, or feature is either unavailable or not functioning properly. (https://www.f5.com/glossary/high-availability) High availability means that an IT system, component, or application can operate at a high level, continuously, without intervention, for a given time period. ... (https://www.cisco.com/c/en/us/solutions/hybrid-work/what-is-high-availability.html) High availability means that we eliminate single points of failure so that should one of those components go down, the application or system can continue running as intended. In other words, there will be minimal system downtime — or, in a perfect world, zero downtime — as a result of that failure. (https://www.mongodb.com/resources/basics/high-availability)

So things termed with High Availability in general try to avoid downtime of their services with the goal of having zero downtime, which is most times not achievable. This can also be seen in this section: ... In fact, this concept is often expressed using a standard known as "five nines," meaning that 99.999% of the time, systems work as expected. This is the (ambitious) desired availability standard that most of us are aiming for. ... (https://www.mongodb.com/resources/basics/high-availability). To achieve these goals, services, hardware or networks are most times provided in a redundant setup, which allows automatic fail-over if instances go down.


Redundancy In engineering and systems theory, redundancy is the intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system... )https://en.wikipedia.org/wiki/Redundancy_(engineering)) In cloud computing, redundancy refers to the duplication of certain components or functions of a system with the intention of increasing its reliability and availability. (https://www.economize.cloud/glossary/redundancy)

HINT: WILL BE CONTINUED LATER

cah-hbaum avatar Jun 25 '24 11:06 cah-hbaum

I brought this issue up in today's Team Container Call and edited the above sections accordingly. As part of #649 we will also get access to Gardener and soon Kamaji clusters.

One thing I want to make you aware of @cah-hbaum: in the call, it was pointed out that term shared control-plane isn't correct. The control-plane isn't shared, instead, the control-plane nodes are shared and thus we should always say shared control-plane node.

(I edited above texts accordingly as well to refer to shared control-plane nodes.)

martinmo avatar Jun 27 '24 09:06 martinmo

Another potential problem with the topology.scs.community/host-id label:

The concept of using the "host-id" may not play nice with VM live migrations.

I do not have any operational experience with e. g. Openstack live migrations (who is triggering them, when, ...?), but I guess that any provider-initiated live migration (which might be standard practice within zones, I guess) would invalidate any scheduling decision that Kubernetes made based on the "host-id" label. As Kubernetes does not reevaluate scheduling decisions, pods may end up on the same host, anyway (if the label even ends up updated). That in turn may be worked around by using the Kubernetes descheduler project.

If I did not miss anything, I guess there are roughly the following options:

  • Rule out live migrations
  • Remove the "host-id" label requirement
  • Specify how certain scenarios should play out in the standard (e. g. requiring descheduler)

joshmue avatar Aug 08 '24 14:08 joshmue

@joshmue Thanks for bringing this to our attention!

So let me try to get this straight.

  • We want environments to use some kind of anti-affinity for their control-plane nodes.
  • We need some kind of transparency so we can check for compliance.
  • Our idea with the host-id label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

  1. for the VMs to be scheduled on different hosts
  2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

How does the host-id label play into this process?

mbuechse avatar Aug 16 '24 11:08 mbuechse

Still, "I do not have any operational experience with e. g. Openstack live migrations", but AFAIK:

I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live".

Yes (not only Control Plane nodes, though).

But would k8s even notice anything about that? What would the process look like?

Exactly that is the problem: Kubernetes would not (per-se) notice anything about that and the process would be undefined.

How does the host-id label play into this process?

Generally, not well, as relying on it for Pod scheduling (instead of e. g. topology.kubernetes.io/zone) may undermine the whole point of anti affinity for HA - if live migrations do happen as I imagine them.

joshmue avatar Aug 19 '24 09:08 joshmue

Please keep in mind that I am researching the topic from scratch but after some digging into the topic I was able to find some useful information about the topic here: https://trilio.io/kubernetes-disaster-recovery/kubernetes-on-openstack/.

I will state some questions after knowing a little bit more about the topic.

piobig2871 avatar Oct 25 '24 12:10 piobig2871

@joshmue Thanks for bringing this to our attention!

So let me try to get this straight.

* We want environments to use some kind of anti-affinity for their control-plane nodes.

* We need some kind of transparency so we can check for compliance.

* Our idea with the `host-id` label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

1. for the VMs to be scheduled on different hosts

2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

Placing a landing node in each physical machine helps some nodes tolerate fault or failsafe mechanisms. This is because of the anti-affinity policies that result in the separation of key components across various nodes, which, in effect, reduces the likelihood of failures. Also, around such distribution requirements, certain checks have to be put in place to ensure that node distribution standards are met.

While the host-id label is helpful in distinguishing physical hosts, it can pose some difficulties in a process of live migrations, especially because it is not very responsive to changes when it comes to node relocation.

Instead of host based label we can use a cluster name to designate the ‘logical group’ or ‘cluster zone’ in a software construct which can alter with the ports address migrations in the respective cluster. This label will be less severe and will affect the live eviction optimally.

  1. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
  2. https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
  3. https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

this is what a theory says.

EDIT: I have found some problem with what I have written here because I haven't took under consideration that our k8s is standing on the OpenStack instance with separated hardware nodes.

piobig2871 avatar Nov 05 '24 11:11 piobig2871

Topic 1

I have been able to install a tenant control plane using Kamaji, but there is several steps that has to be done before it gets to happend.

  1. Create Kind cluster with kind create cluster --name kamaji
  2. Install cert-manager:
helm repo add bitnami https://charts.bitnami.com/bitnami
helm upgrade --install cert-manager bitnami/cert-manager \
    --namespace certmanager-system \
    --create-namespace \
    --set "installCRDs=true"
  1. Install metal LB:
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.13.7/config/manifests/metallb-native.yaml

This installation is performed using manifest, I am leaving a link here to get to know with documentation 4. Now what we have to do is to create IP address pool that is requiered to get real ips. Since we are running on kind I needed to extract gateway ips of the kind network that I am running on.

GW_IP=$(docker network inspect -f '{{range .IPAM.Config}}{{.Gateway}}{{end}}' kind)
NET_IP=$(echo ${GW_IP} | sed -E 's|^([0-9]+\.[0-9]+)\..*$|\1|g')
  1. Right now we can create create kind-ip-pool by applying this script:
cat <<EOF | sed -E "s|172.19|${NET_IP}|g" | kubectl apply -f -
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: kind-ip-pool
  namespace: metallb-system
spec:
  addresses:
  - 172.19.255.200-172.19.255.250
---
apiVersion: metallb.io/v1beta1
kind: L2Advertisement
metadata:
  name: empty
  namespace: metallb-system
EOF
  1. After this initial setup I was able to install the Kamaji
helm repo add clastix https://clastix.github.io/charts
helm upgrade --install kamaji clastix/kamaji --namespace kamaji-system --create-namespace --set 'resources=null'
  1. And Create tenant control plane by kubectl apply -f https://raw.githubusercontent.com/clastix/kamaji/master/config/samples/kamaji_v1alpha1_tenantcontrolplane.yaml

piobig2871 avatar Nov 11 '24 11:11 piobig2871

So let me try to get this straight.

* We want environments to use some kind of anti-affinity for their control-plane nodes.

* We need some kind of transparency so we can check for compliance.

* Our idea with the `host-id` label somehow doesn't play well with live migrations.

I think I still don't quite understand what happens in case of a live migration. I assume that the control-plane nodes are running on virtual machines managed by OpenStack, and such a virtual machine could be migrated "live". But would k8s even notice anything about that? What would the process look like?

I think we are trying to solve for a special case here. Live migrations don't happen all that often. In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules. The good news is that those same rules are evaluated by the scheduler (placement service) when chosing a new host on live migration. So unless something really strange happens, the guarantees after live migrations are the same as they were before. The host-id labels would be wrong now, which is somewhat ugly, but they still correctly indicate that we're not on the same hypervisor host. Now what could happen is that the initial node distribution ended up on different hypervisor hosts just by coïncidence (and not systematically by anti-affinity), so live-migration could change that. In that case, statistics would make this setup break also in the initial setup sooner or later, so this would not go undetected.

I plead for ignoring live migration.

I think I also don't quite understand how the node distribution is implemented. I suppose two levels of anti-affinity would be required:

1. for the VMs to be scheduled on different hosts

2. for the control-plane nodes (or, rather, pods?) to be scheduled on different VMs

The control plane node is a VM. So there is only one dimension.

garloff avatar Nov 11 '24 22:11 garloff

In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules.

When speaking about usual workload K8s nodes, this would mean that there can only be either...

  • active Openstack anti-affinities; Maximum number of K8s nodes is limited by number of hypervisors
  • inactive Openstack anti-affinities; K8s is left making potentially wrong scheduling choices based on potentially outdated "host-id" labels

Right?

joshmue avatar Nov 12 '24 12:11 joshmue

In a standard OpenStack setup, you would achieve control-plane node VMs not ending up on the same hypervisor host by anti-affinity rules.

When speaking about usual workload K8s nodes, this would mean that there can only be either...

* active Openstack anti-affinities; Maximum number of K8s nodes is limited by number of hypervisors

Initially, no two Kubernetes nodes can run on a single hypervisor employing Active Anti-Affinity. So yes, that's right. Once you reach this limit, you either need to add more hypervisors or re-evaluate the anti-affinity rule.

* inactive Openstack anti-affinities; K8s is left making potentially wrong scheduling choices based on potentially outdated "host-id" labels

Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure.

piobig2871 avatar Nov 14 '24 15:11 piobig2871

:+1:

What I'm trying to get at is this:

If there are active OpenStack anti-affinities, there is no use-case for a "host-id" node label to begin with. If node and hypervisor have a 1:1 (or 0:1) relationship, K8s pod anti-affinities can just target kubernetes.io/hostname.

If there are no OpenStack anti-affinities, ...

Kubernetes may make scheduling decisions based on outdated host-id labels if any VMs are migrated. This could lead to clusters where multiple nodes end up on the same hypervisor, potentially creating unexpected single points of failure.

In conclusion, given live migrations may happen occasionally, I do not see any use case for this label.

joshmue avatar Nov 15 '24 08:11 joshmue

The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user.

mbuechse avatar Nov 15 '24 08:11 mbuechse

The labels are not meant to influence scheduling in any way. They are meant to make scheduling transparent to the end user.

Sure?

The current standard says:

Worker node distribution MUST be indicated to the user through some kind of labeling in order to enable (anti)-affinity for workloads over "failure zones".

...and then goes on to describe topology.kubernetes.io/zone, topology.kubernetes.io/region and topology.scs.community/host-id in this context.

This concerns K8s scheduling, not OpenStack scheduling, of course.

joshmue avatar Nov 15 '24 10:11 joshmue

The relevant point (and the one that describes the labels) is

To provide metadata about the node distribution, which also enables testing of this standard, providers MUST label their K8s nodes with the labels listed below.

mbuechse avatar Nov 15 '24 10:11 mbuechse

I must know, because I worked with Hannes on that, and we added this mostly because we needed the labels for the compliance test.

mbuechse avatar Nov 15 '24 10:11 mbuechse

So the standard basically is intended to say:

  • We REQUIRE some sort of labeling in order to enable (anti)-affinity for workloads over "failure zones". We will not standardize them, though.
  • On an unrelated note, we REQUIRE labels which are usually used for anti-affinity (the ones defined by upstream, anyway), but they should not be used for anti-affinity.

?

joshmue avatar Nov 15 '24 10:11 joshmue

I'm not competent to speak on scheduling. It's well possible that these labels are ALSO used for scheduling. In the case of region and availability zone, this is probably true. Question is: how does host-id play into this?

mbuechse avatar Nov 15 '24 10:11 mbuechse

I think that I see where you're coming from, having a focus on compliance testing. Do you see my point that it requires a great deal of imagination to interpret the standard like it was intended from a general POV?

Question is: how does host-id play into this?

I do not think it should, because of the reasons above. I also guess compliance tests should reflect the requirements of a standard, and AFAIU the standard does not forbid placing multiple nodes on the same host. ~~unless~~ If a CSP considers a host to be a "failure zone", they could also put the host-id into topology.kubernetes.io/zone - and then also have problems with live migration and the K8s recommendation of...

It should be safe to assume that topology labels do not change. Even though labels are strictly mutable, consumers of them can assume that a given node is not going to be moved between zones without being destroyed and recreated.

https://kubernetes.io/docs/reference/labels-annotations-taints/#topologykubernetesiozone

joshmue avatar Nov 15 '24 11:11 joshmue

Standard says:

The control plane nodes MUST be distributed over multiple physical machines.

So we need to be able to validate that.

It also says

At least one control plane instance MUST be run in each "failure zone"

But you could have only one failure zone. Then, still, the control plane nodes must be distributed over multiple physical hosts.

The host-id field is not necessarily meant for scheduling (particularly for the control plane, where the user cannot schedule anything, right)?

Does that make sense?

mbuechse avatar Nov 15 '24 11:11 mbuechse

BTW, I'm open to improving the wording to avoid any misunderstanding here. At this point, though, we first have to agree on what's reasonable at all.

mbuechse avatar Nov 15 '24 11:11 mbuechse

The control plane nodes MUST be distributed over multiple physical machines.

Did not see that, actually!

Still, let's go through some cases of what "failure zone" may mean:

  • zone equals one of many co-located buildings
  • zone equals one of many rooms within a building
  • zone equals one of many racks within a room
  • zone equals one of many machines within a rack

If topology.kubernetes.io/zone is defined as any of these things, it can be used to test the standard and the above requirement is satisfied (In a world where a single VM is always local to one hypervisor at any point of time).

Theoretically, one may define "failure zone" as something like:

  • zone equals one of many isolation groups within a machine

But the standard already implicitly says that the smallest imaginable unit is a single ~~unit~~ machine.

Zones could be set from things like single machines or racks up to whole datacenters or even regions

EDIT: But yes, introducing this specific requirement may be a bit confusing, having the other wording referring to logical failure zones. And mandating it may only be checked by having a "host-id" with some strict definition - or (better) defining that topology.kubernetes.io/zone must be at least be a physical machine.

joshmue avatar Nov 15 '24 12:11 joshmue

But you could have only one failure zone.

I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:

It is therefore necessary for important data or services to not be present just on one failure zone

At least one control plane instance MUST be run in each "failure zone"

Since some providers only have small environments to work with and therefore couldn't comply with this standard, it will be treated as a RECOMMENDED standard, where providers can OPT OUT.

joshmue avatar Nov 15 '24 12:11 joshmue

Theoretically, one may define "failure zone" as something like:

* zone equals one of many isolation groups within a machine

Like a network?

But you could have only one failure zone.

I see that this is not explicitly forbidden in the standard, but all the texts hints towards it being forbidden, so I assumed it:

It is therefore necessary for important data or services to not be present just on one failure zone

I have thought about it like we have 1 failure zone by one control plane and the workers may be diverse on the different machines physical or virtual

At least one control plane instance MUST be run in each "failure zone"

Like here is mentioned

piobig2871 avatar Nov 15 '24 13:11 piobig2871

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

mbuechse avatar Nov 15 '24 21:11 mbuechse

Theoretically, one may define "failure zone" as something like:

* zone equals one of many isolation groups within a machine

Like a network?

I just wanted to give an example of a theoretically viable, yet hypothetical runtime unit within a single machine.

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

Yes. CSP's with hosts as failure zones still would have problems with live-migrations and the assumption that topology labels do not change, but by removing the "host-id" requirement, this problem should be exclusive to such small/tiny providers.

On another note, the recommendation here...

At least one control plane instance MUST be run in each "failure zone", more are RECOMMENDED in each "failure zone" to provide fault-tolerance for each zone.

does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here.

joshmue avatar Nov 18 '24 09:11 joshmue

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

You raised an important point about the potential misalignment between the concepts of failure zones and physical hosts. AFAIU from Kubernetes' perspective, failure zones are abstract constructs defined to ensure redundancy and fault isolation. The actual granularity of these zones (e.g., a rack, a data center, or even an individual physical host) depends on the cloud service provider's (CSP's) design.

Kubernetes treats all nodes within a failure zone as equally vulnerable because the assumption is that a failure impacting one could potentially affect all others in the same zone. This approach is why zones matter more than individual hosts when scheduling workloads. For smaller CSPs, defining each host or rack as its own failure zone might be a practical approach to increase redundancy, especially when physical resources are limited. It aligns with your suggestion to mandate multiple zones while dropping specific focus on physical hosts.

At least one control plane instance MUST be run in each "failure zone", more are RECOMMENDED in each "failure zone" to provide fault-tolerance for each zone.

does not seem to take etcd quorum and/or etcd scaling sweet spots into account ( https://etcd.io/docs/v3.5/faq/ ). But it does not strictly mandate questionable design choices (only slightly hints at them), so I will not go into too much detail, here.

Etcd’s own documentation highlights the challenges of maintaining quorum and scalability in distributed systems, particularly as the cluster size increases beyond the optimal sweet spot of 3-5 nodes.

Right now I am wondering what alternative strategies could be employed to balance the need for fault tolerance across failure zones while adhering to etcd’s quorum and scaling best practices?

piobig2871 avatar Nov 18 '24 15:11 piobig2871

Well. It seems that the concepts of failure zone and physical host are a bit at odds.

From the Kubernetes POV two physical hosts within the same failure zone seem to be considered not much better than just one host. In other words, they just don't care that much about hosts. Failure zones can be defined by the CSP in any way they deem appropriate, so smaller CSPs could indeed say each host is a failure zone or each rack is a failure zone. It would probably be better to have multiple zones that are just hosts or racks than to have only one zone. Therefore, we could mandate to have multiple zones and then drop the whole part about the physical hosts (including the host-id label). Is that what you mean?

If that's all true, then I'm wondering why the hosts have been introduced in the first place. There must have been discussions about that in Team Container with intelligent and experienced people involved.

We have an availability zone standard (0121), you probably know it better than me. Many providers do not have several AZs, either because they are too small or because they use shared-nothing architectures with several regions rather than several AZs.

I would highly discourage to now disconnect the notion of infra-layer availability zones from "Failure Zones" in Kubernetes. A recipe for confusion.

Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream host-id label, we have a difficult time to test this from within the cluster. We can still easily test this if we have access to the IaaS layer that hosts the cluster, of course. Not ideal, but no reason to drop the requirement, IMVHO.

garloff avatar Nov 18 '24 15:11 garloff

Single hosts can fail for a variety of reasons, e.g. broken RAM or broken PSU or broken network port or even just a regular maintenance operation (hypervisor or firmware upgrade). In a data center, these events happen much more often than the outage of a complete room/zone/AZ. We want to avoid one host to take down several control plane nodes in the cluster, that is the whole point of having several nodes in the first place. Yes, multi-AZ is nicer, but that is a luxury that we don't always have. Having multiple physical hosts is much better than not. If we can not succeed with an upstream host-id label, we have a difficult time to test this from within the cluster. We can still easily test this if we have access to the IaaS layer that hosts the cluster, of course. Not ideal, but no reason to drop the requirement, IMVHO.

With that comment can we assume that the Node distribution and High Availability topics will be separated for the standard purposes? Would separated standards be more clear than creation of the corner cases?

piobig2871 avatar Nov 19 '24 11:11 piobig2871