chaos-controller icon indicating copy to clipboard operation
chaos-controller copied to clipboard

Limit Safemode List Digest

Open Azoam opened this issue 3 years ago • 1 comments

What does this PR do?

  • [ ] Adds new functionality
  • [ ] Alters existing functionality
  • [X] Fixes a bug
  • [ ] Improves documentation or testing

Please briefly describe your changes as well as the motivation behind them:

  • We found that the safemode feature which checks cluster information to determine blast radius was taking up a lot of memory and OOMing pods. This PR is a potential fix to this, by limiting how many objects are digested at a time, hopefully not filling the memory up.

In regards the branch name consisting of etcd - This was the original idea - to use etcd for the object count look up, but this fell through and now we are using the limiting functionality of the client api.

Code Quality Checklist

  • [X] The documentation is up to date.
  • [X] My code is sufficiently commented and passes continuous integration checks.
  • [X] I have signed my commit (see Contributing Docs).

Testing

  • From the initial sight of this bug, the injector pods were OOMing/timing out. I tested this branch in an environment with 10k+ pods and no OOMing or Timing out was occurring.

Azoam avatar Jul 21 '22 18:07 Azoam

I had hoped there was a way to query the API for just the total count, not a full list, but seems thats lacking

From my understanding and research, there is a way to query, but it requires specific configuration on a kubernetes environment which everyone may not have. Also those who do have it configured may have it locked up behind administrator permissions. So it seems like a safe bet to use the generic client that should be generally accessible.

Azoam avatar Aug 02 '22 17:08 Azoam