envoy icon indicating copy to clipboard operation
envoy copied to clipboard

SlowStart: reject small aggression configs

Open adisuissa opened this issue 1 year ago • 4 comments

Commit Message: SlowStart: reject small aggression configs Additional Description: The SlowStartConfig has an aggression field, that if set to too small may cause the weights to be very small. In this PR we limit the aggression value to be GTE than 1e-30, which I think should be small enough for a reasonable value.

Risk Level: low Testing: Added integration tests, and fuzz test cases. Docs Changes: N/A Release Notes: Added. Platform Specific Features: N/A

Fixes fuzz issues: 66847 and 66837

adisuissa avatar Feb 21 '24 20:02 adisuissa

Assigning @nezdolik as slow-start domain expert. /assign @nezdolik

adisuissa avatar Feb 21 '24 20:02 adisuissa

We do have a slow_start_min_weight_percent_ mechanism in place to protect against generating small weights in slow start and consequent endpoint starvation, e.g.:

double EdfLoadBalancerBase::applySlowStartFactor(double host_weight, const Host& host) const {
 ...
      return host_weight * std::max(applyAggressionFactor(time_factor, aggression),
                                    slow_start_min_weight_percent_);

slow_start_min_weight_percent_ defaults to 0.1, so unless user defines a smaller value for that param, any aggression value that would scale time factor to a value < 0.1, will not have effect and formula would fall back to:

host_weight * slow_start_min_weight_percent_

Thanks for pointing this out, I wasn't aware of that knob. I think there might be a problem here if one sets it to 0 or a very low-value, no?

adisuissa avatar Feb 22 '24 21:02 adisuissa

Right, if someone sets very low value, it could lead to endpoint starvation. Do we need to have programmatic knob to protect against such cases? Since user controls this configuration, we could instead update the docs with recommendations?

nezdolik avatar Feb 22 '24 22:02 nezdolik

Right, if someone sets very low value, it could lead to endpoint starvation. Do we need to have programmatic knob to protect against such cases? Since user controls this configuration, we could instead update the docs with recommendations?

Good question. For the failing fuzz tests, I can limit the values in the fuzzer itself, and by doing so avoid running into the bad state. If you think there are good lower-bounds for everyone to use, then we can add them in the validation (adding lower bounds for both aggression and the min_weight_percent config fields).

adisuissa avatar Feb 23 '24 14:02 adisuissa

@adisuissa agree that we should limit the values in fuzzer to fix the fuzz issue. For the lower bounds of min weight, 0.1 is a safe bet (and we use it as default value for min_weight_percent), but for figuring out more precise bounds (and bounds for aggression in particular) we will need to run an edf scheduler+slow start simulation. I can take a stab at simulation next week.

nezdolik avatar Feb 28 '24 22:02 nezdolik

Closing this PR, as #33036 solves the issue for the fuzzer (although future proofing of the config values should probably be done later).

adisuissa avatar Mar 25 '24 17:03 adisuissa