New feature: Allow to specify failure threshold in percentages
Currently the Uptimer configuration only allows absolute numbers for failure thresholds. The time a CF upgrade deployment takes can however vary significantly. A new stemcell version takes very long as all VMs have to be recreated, whereas a single BOSH release update may only affect a few VMs. So absolute thresholds like "allow only 10 push failures" don't always make sense. Instead, we want to specify percentages which respect the number of total attempts.
Current example configuration:
APP_STATS_THRESHOLD: 5
Current example output:
[UPTIMER] 2024/05/22 08:18:28 FAILED (Stats availability): 6 failed attempts to retrieve stats for app exceeded the threshold of 5 allowed failures (Total attempts: 183, pass rate 96.72%)
Configuration with percentages:
APP_STATS_THRESHOLD_PERCENT: 95
New result:
[UPTIMER] 2024/05/22 08:18:28 SUCCESS (Stats availability): 6 failed attempts to retrieve stats for app did not fall below the threshold of 95% (Total attempts: 61, pass rate 96.72%)
Implementation idea:
- Enhance the
periodicstruct with a percentage threshold parameter: https://github.com/cloudfoundry/uptimer/blob/36ffbdf10ca4aed122e96dcbd444f29ed3c184e3/measurement/periodic.go#L11 - In the
Summaryfunction, check the calculated percentage against the configured value: https://github.com/cloudfoundry/uptimer/blob/36ffbdf10ca4aed122e96dcbd444f29ed3c184e3/measurement/periodic.go#L116
I'd like to hear your opinion how we should change the configuration API for the new feature. Currently we have:
{
"while": [
{
"command": "bosh",
"command_args": [
"--tty",
"-n",
"deploy",
"/tmp/tmp.GGbZ75Hb74",
"-d",
"cf"
]
}
],
"cf": {
"api": "api.cf.trelawney.env.wg-ard.ci.cloudfoundry.org",
"app_domain": "cf.trelawney.env.wg-ard.ci.cloudfoundry.org",
"admin_user": "admin",
"admin_password": "<redacted>",
"tcp_domain": " ",
"available_port": -1,
"tcp_port": -1,
"use_single_app_instance": false
},
"allowed_failures": {
"app_pushability": 10,
"app_stats": 10,
"http_availability": 0,
"tcp_availability": 0,
"recent_logs": 10000,
"streaming_logs": 10000,
"app_syslog_availability": 0
},
"optional_tests": {
"run_app_syslog_availability": false,
"run_tcp_availability": false
}
}
Option 1:
Add new parameters with _percent suffix in allowed_failures. Given percentage is the minimal pass rate:
"allowed_failures": {
"app_pushability_percent": "99.5"
}
Setting both app_pushability and app_pushability_percent would result in a configuration validation error. allowed_failures is however somewhat misleading here.
Option 2:
New parameters with _percent suffix in allowed_failures. Given percentage is the maximum number of failures:
"allowed_failures": {
"app_pushability_percent": "0.5"
}
Semantically more correct, but availabilities are usually specified in "9x.x" percentage numbers.
Option 3: New configuration block:
"required_pass_rate": {
"app_pushability_percent": "99.5"
}
A little more complex, but easier to read.
I don't like Option 1. I think it would be confusing to read "allowed failures" and see 99.5%.