prometheus-engine icon indicating copy to clipboard operation
prometheus-engine copied to clipboard

Target status reporting

Open TheSpiritXIII opened this issue 3 years ago • 5 comments

This PR adds target polling and aggregation into the pod monitoring status subresource.

Polling happens every 10 seconds with 4 concurrent jobs by default but this can be configured, and we save metrics storing the total time taken so we can make adjustments appropriately.

Performance

Example application GKE performance tests: 03 nodes, 02 pods each, 006 pods total -> ~0025 ms 20 nodes, 32 pods each, 640 pods total -> ~0180 ms 50 nodes, 02 pods each, 100 pods total -> ~1200 ms

Scales well with pods increasing, less well with nodes.

This is because there are exactly 2 collector pods per node and we hit each collector, so more nodes means more collectors.

TheSpiritXIII avatar Aug 08 '22 20:08 TheSpiritXIII

"This is because there are exactly 2 collector pods per node and we hit each collector, so more nodes means more collectors." -- do we have to hit both? IIRC they are replicas so we could just hit one, in which case we might ~double node scalability. Thoughts?

lyanco avatar Aug 08 '22 20:08 lyanco

Does this also support polling for ClusterPodMonitoring? Or is that later?

lyanco avatar Aug 08 '22 20:08 lyanco

do we have to hit both?

I'll double check this, but you're right. We do not. If they're replicas, there may be duplicate data in this current implementation.

Does this also support polling for ClusterPodMonitoring?

Yes, with full unit test and integration test suites included for both PodMonitoring and ClusterPodMonitoring.

TheSpiritXIII avatar Aug 08 '22 20:08 TheSpiritXIII

I imagine with very large clusters, there may be a significant delay (~1m) after applying the CR before the status is correct. This could lead to people thinking their CR wasn't picked up correctly if they check this too soon after applying. Is there already, or could we add, a "last completed" time to this?

lyanco avatar Aug 08 '22 20:08 lyanco

I imagine with very large clusters, there may be a significant delay (~1m) after applying the CR before the status is correct.

To also clarify, in this case, it will poll every 10 seconds but if it's still in progress, it will skip that "tick". So we can expect an update each minute. Let me know if you'd like tweaks to this.

Is there already, or could we add, a "last completed" time to this?

Correct, we have a field named LastUpdateTime.

TheSpiritXIII avatar Aug 08 '22 21:08 TheSpiritXIII