PEM agent elasticity

Open MrAta opened this issue 2 years ago • 1 comments

Describe the bug I was trying to figure out the elasticity of PEM agents with the following scenario to see how adding/removing pem pods in a cluster affects the functionality of Pixie.

To Reproduce Steps to reproduce the behavior:

Deploy PEM on a single node (using a node label e.g. node.workload=pixie)
Add a new node (using above label) so that the daemonset controller launches a new PEM pod

With the above two steps, here is the expected behavior: Expected behavior

The new agent is registered and detected by Pixie
The new node and its pods including the PEM Pod can be instrumented and shown in the cluster script result.

Observed behavior

The new agent is registered just fine and can be seen in the control plane
The new node shows up in the cluster script results including ALL the pods on this new node except the PEM pod.

Update: After manually restarting the metadata Pod, the newly added PEM pod shows up in the cluster script results.

As for the reverse: To Reproduce

Remove a node form the instrumented nodes (e.g. by removing the label)

Expected behavior

The node is removed from the instrumented nodes
The agent in PEM pod on the removed node is deregistered and deleted by MDS

Observed behavior

The node and the PEM pod are removed just fine
The agent is remained and deemed UNHEALTHY and hence affects the cluster script running int 400 or 500 error

Update: After manually restarting the metadata Pod, the newly removed agent is deregistered and the cluster script runs fine.

Is it possible to make the above cycle to be done by the MDS?

May 03 '23 21:05 MrAta

Hey @MrAta ! All of the expected behavior should already be handled by MDS. I gave it a try on a smaller 2-node cluster (scaling down the cluster to 1 node, then back up to 2), and saw the expected results for both cases.

I wonder if this is related to a scalability issue, for example if there are a lot more nodes/more K8s resources. We can try to address this in some of our upcoming scalability work.

Jun 01 '23 17:06 aimichelle