OF to treat dangling relations as if they do not exist?
When an app is removed with --force, we end up with a dangling relation, so any subsequent attempt to interact with the relation results in tracebacks in the debug-log and the unit is in error:
-
subprocess.CalledProcessError: Command '('/var/lib/juju/tools/unit-alertmanager-k8s-1/network-get', 'alerting', '-r', '4', '--format=json')' returned non-zero exit status 1. -
ops.model.ModelError: b'ERROR relation 4 not found (not found)\n' -
ops.model.ModelError: b'ERROR permission denied\n'
Charms could start adding e.g. with contextlib.suppress(ModelError): all over the place, but perhaps OF could/should treat dangling relations as if they don't exist?
Example: https://github.com/canonical/alertmanager-k8s-operator/issues/65
Hrm.
--force is a destructive command that could get a model into a bad state, so I'm not sure that it is correct to handle this automatically in the framework.
But we also don't want to leave folks with a broken cloud.
@jameinel is there a good way to resolve this with Juju CLI commands? If there's a good way to clean up the dangling relation with the CLI tools, I think that I'd rather keep the current ops behavior, perhaps with some additions to the text of the error message, explaining how to resolve it.
So the relations should get cleaned up on Juju's side (so relation-list
shouldn't expose them). But certainly things like deferred events are
tracking what relation ids they were originally fired on.
If relation-ids is reporting 4 then I would consider that a Juju bug. But
if you '--force' removal, the whole part of that is we can't guarantee
clean shutdown (the user is asking to not wait for charms to exit cleanly).
It may be that juju will try to fire a relation-departed but the
relation-id is well and truly gone. We should look at it on our end.
(Normally we wouldn't destroy the unit until associated things have
acknowledged that it is going away, but again '--force' is asking to ignore
things going into error state, etc)
On Wed, Apr 27, 2022 at 11:58 AM Pen Gale @.***> wrote:
Hrm.
--force is a destructive command that could get a could into a bad state, so I'm not sure that it is correct to handle this automatically in the framework.
But we also don't want to leave folks with a broken cloud.
@jameinel https://github.com/jameinel is there a good way to resolve this with Juju CLI commands? If there's a good way to clean up the dangling relation with the CLI tools, I think that I'd rather keep the current ops behavior, perhaps with some additions to the text of the error message, explaining how to resolve it.
— Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/745#issuecomment-1111175003, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7IOZQM4ZKS3R4I72PTVHFP2PANCNFSM5UNYIWRA . You are receiving this because you were mentioned.Message ID: @.***>
Thx for the comment, @jameinel!
.defer is a place where I'd be much more comfortable squishing potential errors like this, provided that we responsibly log the squish.
I don't think we're going to provide these guarantees with --force, and in addition, we're trying to wean people off --force as much as possible -- closing.
Just clarifying in case this was closed for the wrong reason:
- If we have a charm in error state, it is impossible to upgrade it, nor remove it without force.
- When forcefully removing the charm that is in error, as a consequence, any related healthy charm may go into error state because of the dangling relation (this issue).
Oh, thanks @sed-i -- I'd misread part of the original issue. Yeah, point 2 seems like a problem. Reopening.
This needs further investigation, for example, with Leon's repro case, to see what's going on on the Juju side. John's message makes it sound like Juju should be cleaning up the relation data in this case, but maybe it isn't, or something else is going on.
@sed-i Out of interest, are you working around this issue in the meantime?
Afaik, we did not end up adding workarounds directly for this. But I do not recall encountering this issue recently.
I haven't been able to reproduce this. For example:
- Integrate two charms.
- Cause one to go into error state (I've just had a
NameErrorin an event handler, but I assume the exact error doesn't matter) - Remove the error'd application with
--force - The relation doesn't show up at all in
relation-ids. The remaining charm behaves in the same way as if I'd never done the integration in the first place.
I've tried Juju 3.1 (LXD localhost), 3.3 (microk8s), and 3.4 (microk8s). I've tried with and without data in the relation databags (on both sides, in both app and unit), and a few different causes of error state, but it's all the same as the steps above.
@sed-i any thoughts on anything I could do differently to try to reproduce? Anything I'm missing about the problem case?
If not, then I think we might need to assume that this has been resolved on the Juju (force) side and close this, and if someone does notice it in the future, try to jump in very quickly to get more details/reproduction steps.
I'll close this for now, but if you do know of some way I could reproduce this (other than what I've described trying above) or if anyone does run into this again, please do re-open.