cilium-cli icon indicating copy to clipboard operation
cilium-cli copied to clipboard

test: Add cilium monitor support

Open jrajahalme opened this issue 4 years ago • 5 comments

Add cilium monitor for each action.Run() to gain some ground truth for analyzing flow validation failures, where some flows seem to be missing from Hubble.

Start cilium monitors on each Cilium node right before each action run, and stop the monitors right after. Print the monitor output if any warnings or fails were logged during the action execution.

The monitor output is currently very verbose, as it includes all flows from the node for the duration of the test. We will need to add filtering going forward, but there are caveats:

  • filtering on IP could hide flows that have been incorrectly masqueraded
  • filtering on the port number would hide DNS, and also NodePort etc flows
  • There is background health check traffic that adds to the verbosity, but sometimes health checks use the same port as the tests, so it's hard to filter it all out.

jrajahalme avatar Jun 12 '21 03:06 jrajahalme

@ti-mo Thanks for the review, I farmed changes out to smaller PRs before noticing your review, will pick up the applicable comments from above and apply where needed.

We have some test flakes where command succeeds, but flow validation fails to find the TCP SYN, for example. This may be a case of Hubble missing flows, or something else. Monitoring is added to obtain some ground truth to be able to triage such flow validation failures. Hopefully we do not need this long term, but we need some tooling now to understand what is going on.

jrajahalme avatar Jun 14 '21 21:06 jrajahalme

Hopefully we do not need this long term, but we need some tooling now to understand what is going on.

I understand, but could we at least find a solution that doesn't increase the complexity of the framework to enable a 'temporary' hack? (I'm quoting 'temporary' because when it comes to code, everything is permanent until deprecated :slightly_smiling_face:) This feels like quite an impactful change to me, although smaller PRs might make it more palatable. We should direct efforts towards troubleshooting and improving this in Hubble, without creating technical debt here.

ti-mo avatar Jun 15 '21 09:06 ti-mo

We should direct efforts towards troubleshooting and improving this in Hubble, without creating technical debt here.

I fully agree with Timo on this. Perhaps one way to go about this would be to add a test in Cilium that asserts whether Hubble and cilium monitor present exactly same data or have behavioural disparity of some sort. I believe deprecation of cilium monitor has been spoken of, and perhaps such a test could be used as a vehicle to drive cilium monitor down the road of deprecation. Attempting to test that behaviour here is way too indirect.

In the meantime, it might make sense to add a few simpler things here that would enable us to better understand the behaviour:

  • add flags to set utils.WaitParameters.{Timeout,RetryTimeout} https://github.com/cilium/cilium-cli/blob/39ac4f2103062dca5e5becc6ae13bf022846bfc6/connectivity/check/action.go#L521-L522
  • instrument utils.WaitObserver with a retry counter that can be observed by the user
  • add action retry logic and a flag to set maximum number of action retries

errordeveloper avatar Jun 16 '21 08:06 errordeveloper

@errordeveloper https://github.com/cilium/cilium-cli/pull/319 attempts to stream the logs in from Hubble instead of repeatedly fetching them from a ring buffer that will eventually get overwritten. The current approach is not reliable, so the issues we're seeing are kind of expected under the current circumstances. :) Will try get this done this week.

ti-mo avatar Jun 16 '21 09:06 ti-mo

@ti-mo oh! I was assuming flows were being streamed...

errordeveloper avatar Jun 16 '21 09:06 errordeveloper