PEM agent elasticity
Describe the bug I was trying to figure out the elasticity of PEM agents with the following scenario to see how adding/removing pem pods in a cluster affects the functionality of Pixie.
To Reproduce Steps to reproduce the behavior:
- Deploy PEM on a single node (using a node label e.g. node.workload=pixie)
- Add a new node (using above label) so that the daemonset controller launches a new PEM pod
With the above two steps, here is the expected behavior: Expected behavior
- The new agent is registered and detected by Pixie
- The new node and its pods including the PEM Pod can be instrumented and shown in the cluster script result.
Observed behavior
- The new agent is registered just fine and can be seen in the control plane
- The new node shows up in the cluster script results including ALL the pods on this new node except the PEM pod.
Update: After manually restarting the metadata Pod, the newly added PEM pod shows up in the cluster script results.
As for the reverse: To Reproduce
- Remove a node form the instrumented nodes (e.g. by removing the label)
Expected behavior
- The node is removed from the instrumented nodes
- The agent in PEM pod on the removed node is deregistered and deleted by MDS
Observed behavior
- The node and the PEM pod are removed just fine
- The agent is remained and deemed UNHEALTHY and hence affects the cluster script running int
400 or 500 error
Update: After manually restarting the metadata Pod, the newly removed agent is deregistered and the cluster script runs fine.
Is it possible to make the above cycle to be done by the MDS?
Hey @MrAta ! All of the expected behavior should already be handled by MDS. I gave it a try on a smaller 2-node cluster (scaling down the cluster to 1 node, then back up to 2), and saw the expected results for both cases.
I wonder if this is related to a scalability issue, for example if there are a lot more nodes/more K8s resources. We can try to address this in some of our upcoming scalability work.