Make parse errors of command-line knobs stop the program rather than …
…emit a warning.
FDB binaries currently emit a warning to stderr as well as the logs when they encounter command-line options for
- knobs which don't exist, and
- knob values which fail to parse. They then proceed as if the command-line argument had never been provided.
FDB servers are often managed by automation, and no one sees warnings in the logs or stderr unless they go looking for them. This means that a typo in the name of a knob will result in a server running without the knob setting its operator meant to set. Similarly, an attempt to specify a memory limit as a knob value using command suffixes, e.g. '4GiB', will also fail, but the server will proceed with the default value of that knob, contrary to the operator's intention.
I believe the safer option is to fail hard when we don't understand command-line arguments, rather than emit a warning and proceed. Running with unexpecetd knob values can lead to failures which are difficult to diagnose. Whereas failing immediately at program start is straightforward to remedy.
New behavior: $ bin/fdbserver --knob_commit_batches_mem_bytes_hard_limit=4GiB ERROR: Invalid value '4GiB' for knob option 'commit_batches_mem_bytes_hard_limit' Error: Option set with an invalid value
Replace this text with your description here...
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
- [ ] The PR has a description, explaining both the problem and the solution.
- [ ] The description mentions which forms of testing were done and the testing seems reasonable.
- [ ] Every function/class/actor that was touched is reasonably well documented.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
- [ ] This change/bugfix is a cherry-pick from the next younger branch (younger
release-branchormainif this is the youngest branch) - [ ] There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)
The choice to warn and not fail was intentional to avoid situations where an invalid knob specification brings down all of the processes in a cluster and causes an availability loss. For example, if we remove a knob that clusters were specifying, they will fail after an upgrade. Similarly, if we are deploying knobs across a fleet that is running different versions, older ones may fail if the knob is only present in new versions.
There are some existing systems for managing FDB clusters that depend on this property of knobs, so if we decided to change this we would have to do so in coordination with the folks managing those systems.
My preference would be that we leave the existing behavior and just make sure that when a knob that doesn't get applied is properly surfaced through our monitoring channels so that we would see the issue and could rectify it.
Result of foundationdb-pr on Linux CentOS 7
- Commit ID: fdb5059f6faf0b6cfd99dcd4f46f5bedab56c225
- Duration 1:02:57
- Result: :white_check_mark: SUCCEEDED
- Error:
N/A - Build Logs (available for 30 days)
- Build Artifact (available for 30 days)
Doxense CI Report for Windows 10
- Commit ID: fdb5059f6faf0b6cfd99dcd4f46f5bedab56c225
- Result: :heavy_check_mark: SUCCEEDED
- Build Logs (available for 30 days)
The choice to warn and not fail was intentional to avoid situations where an invalid knob specification brings down all of the processes in a cluster and causes an availability loss. For example, if we remove a knob that clusters were specifying, they will fail after an upgrade. Similarly, if we are deploying knobs across a fleet that is running different versions, older ones may fail if the knob is only present in new versions.
Thanks for the explanation. Can we revisit this choice? The choice to warn has led to an actual cluster-wide unavailability.
There are cases where this leniency makes sense, like the ones you've brought up, and cases where we'd rather have gotten an early warning at process startup time than proceed with the wrong knob value and crash later.
There are some existing systems for managing FDB clusters that depend on this property of knobs, so if we decided to change this we would have to do so in coordination with the folks managing those systems.
Yes, agreed.
My preference would be that we leave the existing behavior and just make sure that when a knob that doesn't get applied is properly surfaced through our monitoring channels so that we would see the issue and could rectify it.
The problem is that warnings aren't, in fact, getting properly surfaced and rectified. We'd have to add a specific filter for this warning or else face warning fatigue. Actually, that's what this PR accomplishes: it turns this particular warning into an error that we'll notice, hopefully before we deploy a change to the entire cluster.
Alternatively, we could add an additional flag so that the current behavior remains the default, but these warnings can be turned into fatal errors as desired. I wanted to at least have a discussion, though, before splitting the baby.
Result of foundationdb-pr-cluster-tests on Linux CentOS 7
- Commit ID: fdb5059f6faf0b6cfd99dcd4f46f5bedab56c225
- Duration 2:46:48
- Result: :x: FAILED
- Error:
Error while executing command: if $(grep -q -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log); then echo "TESTS FAILED SEE THESE LOGS:"; echo ; grep -l -- "--- FAIL:" ${CODEBUILD_SRC_DIR}/fdb-kubernetes-tests/logs/*.log; exit 1; fi. Reason: exit status 1 - Build Logs (available for 30 days)
- Build Artifact (available for 30 days)
Thanks for the explanation. Can we revisit this choice? The choice to warn has led to an actual cluster-wide unavailability.
The other choice has also led to cluster unavailabilities, for what it's worth. If I understand correctly, the bug that led to a problem here wasn't that we treated this as a warning but rather that we incorrectly parsed the argument. It turns out in this case we would have known about the problem because old versions rejected the argument and we ran the knob on old versions, but interestingly if someone deployed that broken knob on old versions and we treated as a hard error we also would have lost availability there. The main saving grace in that case is that if you rolled out the knob through some test clusters first you could detect the problem earlier, and the feedback would also be more immediate.
There are cases where this leniency makes sense, like the ones you've brought up, and cases where we'd rather have gotten an early warning at process startup time than proceed with the wrong knob value and crash later.
We should have visibility into these issues, and the fact that we aren't getting that is a problem. There are likely ways we can make this better (e.g. marking a cluster with invalid knobs unhealthy, which I think we do get notified on).
Actually, that's what this PR accomplishes
It does do that, but it does more as well. If we could just surface this error without bringing down a cluster, then it would be a no-brainer. That said, I'm not against the idea of making this behavior configurable.