Should relation data be visible to charm code from within a `relation-broken` hook?
Currently, harness first emits relation-broken and only then invalidates the data.
This means that relation data is still accessible by charm code while inside the relation-broken hook. Is that intentional?
Based on the event page it sounds like charm code shouldn't be able to see any data when inside the broken hook.
https://github.com/canonical/operator/blob/4d38ef7ce2fe6dfc4034db5957d9e526ce0cf3d9/ops/testing.py#L697-L702
Juju actually allows the charm to access (stale?) relation data in relation-broken. However...
I've just tested this using two charms related to one another, each having a relation-broken hook and accessing relation data (from "this" or the "remote" app). The Harness actually gets this fairly different from real Juju in a number of cases.
Below are the results. The "setter" charm is the charm in the relation that's setting relation data via event.relation.data[self.app]["key"] = value, and the "getter" charm is the charm that's reading relation data.
| charm | which app | Juju result | Harness result |
|---|---|---|---|
| setter | this | {...data...} |
{...data...} |
| setter | remote | {} |
RuntimeError |
| getter | this | {} |
RelationDataAccessError |
| getter | remote | {...data...} |
RuntimeError |
- Both
RuntimeErrors have the message "remote-side relation data cannot be accessed during a relation-broken event". - The
RelationDataAccessErrormessage is "unit/0 is not leader and cannot read its own application databag", and in__repr__this gets caught and converted to the string<n/a>.
Show raw data from test
To test this (unit tests and real Juju), I hacked my database and webapp charms that I've been using for secrets to add logging into the relation-broken hooks. The database charm is the "setter" and it sets relation data, and the webapp charm is the "getter" as it reads relation data. self_data means data accessed via event.relation.data[self.app] and event_data means remote data accessed via event.relation.data[event.app].
# real Juju
unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: <ops.model.Relation db:0> self_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}
unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: <ops.model.Relation db:0> event_data={}
unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> self_data={}
unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}
# database unit tests
database _on_db_relation_broken: <ops.model.Relation db:0> self_data={'db_password_id': 'secret:20a43c1e-41b3-49b2-ba42-6e11ec2cfffb'}
database: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event
# webapp unit tests
webapp _on_db_relation_broken: <ops.model.Relation db:0> self_data=<n/a> # masks: exception when getting self data: webapp/0 is not leader and cannot read its own application databag
webapp: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event
It seems odd that the Harness deviates so much from real Juju, which just allows reads in each case, even if the data is not useful/stale.
I presume we intentionally raise more errors than Juju in tests to try to catch problems early -- for example, the charm probably shouldn't be accessing remote relation data during relation-broken (but Juju lets you). And I'm not sure about the RelationDataAccessError for the case when the "getter" charm tries to read its own data -- that doesn't seem correct.
As to the original issue, it seems like the data is unable to be accessed (in 3 out of 4 cases!). @sed-i, can you post the actual code you were working with when you ran into this? Were you fetching relation data? Was it in the "getter" or "setter" charm? And was it via self.app (this) or event.app (remote)?
In any case, we should decide whether we want to mimic real Juju more closely. Or we should raise an error consistently if you access relation data in relation-broken in all cases. Is there ever a valid use case for doing that? @jameinel, thoughts?
can you post the actual code you were working with when you ran into this?
Here's the utest that expects the charmlib/juju to take care of cleaning up relation data.
Specifically, the utest was expecting that whatever custom events fire as a result of self.harness.remove_relation(rel_id), would not see any relation data. I think in this test it's the remote data that is expected to go away.
Is there ever a valid use case for doing that?
The pattern we were taking in o11y charms more often than not, is that rel data represents the most up to date state.
After a relation-broken there is no other event that "reruns the charm" with the updated reldata. Relation-broken is the last chance to act on a change. This way, deep charm code doesn't need to know the event it's in (no need to if event is relation-broken then update everything ignoring data).
I'm probably missing something obvious, but I don't quite understand -- doesn't the above table show that relation-broken on real Juju still includes the previous data? So wouldn't expecting the Harness to do something different mean the unit test will behave differently in unit tests compared to under real Juju?
I'm not sure I understand the table correctly: On relation broken, the remaining app can read the data that was set by the departed app?
Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. (@jameinel any idea why the Harness tries to be different/stricter than reality here?)
You can see this from the following log line:
unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}
The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.
During relation-broken you should be able to read the data from the unit that was on the other end, if we aren't doing that in Harness, we should allow it. (my only caveat is that really relation-departing is where you should be handling the information, by the time you get to 'relation-broken' its because the relation really is gone and you're supposed to be finalizing things, not doing more configuration based on the last thing that the remote unit shared with you.)
Note while you can read, you shouldn't be allowed to set data at that point.
I also feel that 'relation-broken' really shouldn't be "just another relation event like all the other ones". Deleting a file on disk is completely different to modifying it, or creating it. So the original argument that "I want a catch all behavior" doesn't really fit for me.
John =:->
On Tue, Jan 17, 2023 at 8:58 PM Ben Hoyt @.***> wrote:
Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. @.*** https://github.com/jameinel any idea why the Harness tries to be different/stricter than reality here?)
You can see this from the following log line:
unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}
The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.
— Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/888#issuecomment-1386355879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7IB6DJU7POVV7INTODWS5E5NANCNFSM6AAAAAATZLT5HE . You are receiving this because you were mentioned.Message ID: @.***>
Iiuc, this means that the following pattern is wrong:
def _on_relation_departed(self, _): # or broken
self._update_config() # regenerate everything from current rel data
and instead we should do something like:
def _on_relation_departed(self, event): # or broken
self._update_config(excluding=event.relation.data)
Is that correct?
In other words, from within relation-departed/broken:
- is
event.relationincluded inself.model.relations? - is
event.relation.dataincluded inself.model.relations[x].data?
Hi everyone, chiming in on this; is what @sed-i proposed the pattern we're supposed to follow?
@PietroPasotti just gave me an idea: If we always defer a relation-broken event, then next hook (on update-status the latest) there will be no data left in relation data, so charm code could operate on the entire relation data, i.e. without needing to work with the delta that a relation-broken implies.
This is not a great pattern, but it conveys well our dissonance about relation-broken.
FYI in some cases (but not all) accessing the remote application data in a relation broken event causes an error
See: https://bugs.launchpad.net/juju/+bug/1960934 https://github.com/canonical/operator/blob/734e12dcfde93d7081aed5573e011128d98fd84a/ops/model.py#L1341-L1349 https://github.com/canonical/mysql-router-k8s-operator/issues/73
What the Kubeflow team has seen with istio-pilot is like what @carlcsaposs-canonical reports. In live Juju when handling a relation-broken event:
- sometimes
event.app=DEPARTING_APP(and I thinkrelation.appis the same thing? probably an alias?) - sometimes
event.app=None
Up until today, we had only seen case (1) and whenever we saw it, we also knew that the departing application's data is still in the relation data bag. We handled this by popping the departing data before using the data bag.
# (simplified version - differs slightly from the link)
if isinstance(event, RelationBrokenEvent):
relation_data.pop((event.relation, event.app))
Now that sometimes we see event.app=None, I wonder if we should instead do something like:
if isinstance(event, RelationBrokenEvent):
try:
relation_data.pop((event.relation, event.app))
except KeyError:
log_and_pass # ?
The one question I have is whether, when event.app==None, are we guaranteed that the departing application's data has been removed from the databag? If not, that will cause trouble as we can't pop it
@jameinel Per the above and per https://bugs.launchpad.net/juju/+bug/1960934, it seems Juju is sometimes setting JUJU_REMOTE_APP but sometimes not setting it. Do you think that could be fixed on the Juju side? Or I guess we could change Juju to always not set it, but that might be too breaking.
https://bugs.launchpad.net/juju/+bug/2024583
Also, I think we usually don't want to see the stale data even on relation-departed:
- apps foo and bar are related.
- bar scaled down from 2 to 1.
- foo received relation-departed, but not broken.
- foo finishes running with the old data view which included the departing unit, and will only be able to act on the new data view (without bar/2) on update-status.
Real world example:
- Alertmanager is scaled from 2 to 1
- Prometheus needs to regenerate its yaml config file with only one target under the alertmanagers section.
This issue is related to a frequent point of friction in charming: reconciling holistic vs deltas approaches.
- Relation events give us deltas: unit/x joined or unit/x departed, and now we need to append/pop a section to/from an existing config file.
- Idempotentency and robustness call for holistic: for example, on config-changed after upgrade we need to iterate over all relations to construct a full config from scratch.
Need to consider further what to do here. Possibly related to https://github.com/canonical/operator/issues/940 work.
Related: https://github.com/canonical/operator/issues/940#issuecomment-1623559826 (difference in usage between local and remote app data during relation-broken)
This is fixed in ops 2.10, isn't it?
This is fixed in ops 2.10, isn't it?
I believe the relation data is still accessible—and I think it should be, for the local app/unit databags
@tonyandrewmeyer is going to investigate this further and then we'll make a decision here.
There are three issues covered in this ticket:
JUJU_REMOTE_APP not being set
The Juju team say that this should always be set, so that's how ops behaves. It does seem like that may not always be true, which the Juju team would consider a bug. That needs to get fixed in Juju, rather than ops trying to work around it. The bug is marked as incomplete because it couldn't be reliably reproduced, but there are more recent comments that indicate that maybe it can be now.
As far as I know, Charm-Tech has no "bug slots", so we can't get movement on this. However, I assume the charming teams do have "bug slots", and I'd suggest that one/more of them use some of those to get that ticket moved back into an active status and resolved.
Whether a relation should show in relation-ids when broken
Resolved in this Juju bug (and the earlier ops change that has the same behaviour), for better or worse.
Harness matching Juju behaviour
Note that Scenario doesn't add any additional checks for getting or setting, apart from a Juju version check (>2.7), and a check that probably ought to be an assert (it should never be reached). So the behaviour there should be the same as ops provides with a real Juju backend.
When reading, ops itself checks if Relation.app is None (it's typed to never be) and raises if so. I feel we should remove this check: according to Juju (see the first issue) this shouldn't happen, and it seems cleaner to just let Juju provide an error if there should be one (but again, that shouldn't be the case). That would mean that instead of relying on this for the check if it's a peer relation, we would just assume that if app is None, then it's not a peer relation (this seems better than having ops raise an AttributeError if app isn't set).
When reading, ops also checks if the relation app name matches the local app name and if so (and the unit isn't the leader) raises. I feel we should remove this check too: it again seems cleaner for Juju to provide the error, like:
$ juju exec --unit provider-charm/6 -- relation-get -r 22 - dummy --app
ERROR permission denied
We could keep this backwards compatible by checking for "permission denied" and raising RelationDataAccessError in that case rather than a generic ModelError. We'd have to move the check into Harness in this case, though, and Scenario would also need to learn to do this check. On balance, I think it would have been the better choice originally, but doesn't have enough benefit to do now.
When writing, ops again duplicates the Juju logic (can't write to a different unit, can't write to a remote app databag, can't write to the local app databag unless leader). I again feel that we would have been better just letting this logic live in Juju (a failure would mean an extra subprocess call, but failures ought to be rare and a subprocess call is inexpensive compared to most charming work), but that ripping it out now doesn't really offer a significant benefit.
When reading, Harness only checks that the relation is in the relations that have been added. Outside of events (and __init__, as of next ops) some of the run-time checks are also disabled.
When writing, Harness has the extra "is this relation-broken" check. We should drop that - there's already separate validation that a unit can't write to the app databag for a different app, and Juju allows writing to the local app databag (if leader) in relation-broken, even though that's a strange thing to do.
As far as I know, Charm-Tech has no "bug slots", so we can't get movement on this. However, I assume the charming teams do have "bug slots", and I'd suggest that one/more of them use some of those to get that ticket moved back into an active status and resolved.
Sinan commented in Matrix that Charm-Tech could actually push this and that he'd have the Juju team look into it 🎉
When reading, ops itself checks if
Relation.appisNone(it's typed to never be) and raises if so. I feel we should remove this check: according to Juju (see the first issue) this shouldn't happen, and it seems cleaner to just let Juju provide an error if there should be one (but again, that shouldn't be the case). That would mean that instead of relying on this for the check if it's a peer relation, we would just assume that ifappisNone, then it's not a peer relation (this seems better than having ops raise anAttributeErrorifappisn't set).
I'm wondering if this is actually worth doing or not (and there's been a little push-back from posting about it in Charm Development in Matrix).
A lot of cases already work even if JUJU_REMOTE_APP is missing. If the relation is a peer, then the app is always set because there isn't a 'remote' app. If the relation is from self.model.relations then the app typically comes from relation-list, similarly with self.model.get_relation - we have quite a lot of fallbacks.
It seems like changing the behaviour here risks introducing a subtle behaviour change that will cause existing charms problems, and it's all to patch over a Juju bug that we hope will get fixed anyway. So leaving in the existing bandage is probably best.
(We still want to fix the Harness relation-broken check).