operator icon indicating copy to clipboard operation
operator copied to clipboard

Should relation data be visible to charm code from within a `relation-broken` hook?

Open sed-i opened this issue 3 years ago • 20 comments

Currently, harness first emits relation-broken and only then invalidates the data.

This means that relation data is still accessible by charm code while inside the relation-broken hook. Is that intentional?

Based on the event page it sounds like charm code shouldn't be able to see any data when inside the broken hook.

https://github.com/canonical/operator/blob/4d38ef7ce2fe6dfc4034db5957d9e526ce0cf3d9/ops/testing.py#L697-L702

sed-i avatar Jan 12 '23 15:01 sed-i

Juju actually allows the charm to access (stale?) relation data in relation-broken. However...

I've just tested this using two charms related to one another, each having a relation-broken hook and accessing relation data (from "this" or the "remote" app). The Harness actually gets this fairly different from real Juju in a number of cases.

Below are the results. The "setter" charm is the charm in the relation that's setting relation data via event.relation.data[self.app]["key"] = value, and the "getter" charm is the charm that's reading relation data.

charm which app Juju result Harness result
setter this {...data...} {...data...}
setter remote {} RuntimeError
getter this {} RelationDataAccessError
getter remote {...data...} RuntimeError
  • Both RuntimeErrors have the message "remote-side relation data cannot be accessed during a relation-broken event".
  • The RelationDataAccessError message is "unit/0 is not leader and cannot read its own application databag", and in __repr__ this gets caught and converted to the string <n/a>.
Show raw data from test

To test this (unit tests and real Juju), I hacked my database and webapp charms that I've been using for secrets to add logging into the relation-broken hooks. The database charm is the "setter" and it sets relation data, and the webapp charm is the "getter" as it reads relation data. self_data means data accessed via event.relation.data[self.app] and event_data means remote data accessed via event.relation.data[event.app].

# real Juju

unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: <ops.model.Relation db:0> self_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}
unit-database-0: 15:25:26 INFO unit.database/0.juju-log db:0: database _on_db_relation_broken: <ops.model.Relation db:0> event_data={}

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> self_data={}
unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

# database unit tests

database _on_db_relation_broken: <ops.model.Relation db:0> self_data={'db_password_id': 'secret:20a43c1e-41b3-49b2-ba42-6e11ec2cfffb'}
database: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event

# webapp unit tests

webapp _on_db_relation_broken: <ops.model.Relation db:0> self_data=<n/a>  # masks: exception when getting self data: webapp/0 is not leader and cannot read its own application databag
webapp: exception when getting event data: remote-side relation data cannot be accessed during a relation-broken event

It seems odd that the Harness deviates so much from real Juju, which just allows reads in each case, even if the data is not useful/stale.

I presume we intentionally raise more errors than Juju in tests to try to catch problems early -- for example, the charm probably shouldn't be accessing remote relation data during relation-broken (but Juju lets you). And I'm not sure about the RelationDataAccessError for the case when the "getter" charm tries to read its own data -- that doesn't seem correct.

As to the original issue, it seems like the data is unable to be accessed (in 3 out of 4 cases!). @sed-i, can you post the actual code you were working with when you ran into this? Were you fetching relation data? Was it in the "getter" or "setter" charm? And was it via self.app (this) or event.app (remote)?

In any case, we should decide whether we want to mimic real Juju more closely. Or we should raise an error consistently if you access relation data in relation-broken in all cases. Is there ever a valid use case for doing that? @jameinel, thoughts?

benhoyt avatar Jan 13 '23 03:01 benhoyt

can you post the actual code you were working with when you ran into this?

Here's the utest that expects the charmlib/juju to take care of cleaning up relation data. Specifically, the utest was expecting that whatever custom events fire as a result of self.harness.remove_relation(rel_id), would not see any relation data. I think in this test it's the remote data that is expected to go away.

Is there ever a valid use case for doing that?

The pattern we were taking in o11y charms more often than not, is that rel data represents the most up to date state. After a relation-broken there is no other event that "reruns the charm" with the updated reldata. Relation-broken is the last chance to act on a change. This way, deep charm code doesn't need to know the event it's in (no need to if event is relation-broken then update everything ignoring data).

sed-i avatar Jan 13 '23 06:01 sed-i

I'm probably missing something obvious, but I don't quite understand -- doesn't the above table show that relation-broken on real Juju still includes the previous data? So wouldn't expecting the Harness to do something different mean the unit test will behave differently in unit tests compared to under real Juju?

benhoyt avatar Jan 17 '23 04:01 benhoyt

I'm not sure I understand the table correctly: On relation broken, the remaining app can read the data that was set by the departed app?

sed-i avatar Jan 17 '23 06:01 sed-i

Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. (@jameinel any idea why the Harness tries to be different/stricter than reality here?)

You can see this from the following log line:

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.

benhoyt avatar Jan 18 '23 01:01 benhoyt

During relation-broken you should be able to read the data from the unit that was on the other end, if we aren't doing that in Harness, we should allow it. (my only caveat is that really relation-departing is where you should be handling the information, by the time you get to 'relation-broken' its because the relation really is gone and you're supposed to be finalizing things, not doing more configuration based on the last thing that the remote unit shared with you.)

Note while you can read, you shouldn't be allowed to set data at that point.

I also feel that 'relation-broken' really shouldn't be "just another relation event like all the other ones". Deleting a file on disk is completely different to modifying it, or creating it. So the original argument that "I want a catch all behavior" doesn't really fit for me.

John =:->

On Tue, Jan 17, 2023 at 8:58 PM Ben Hoyt @.***> wrote:

Yeah, that's right -- that's the last row in the table. It shows the "getter" app (i.e., the other charm from the one that set the data) being able to read data that the remote app (the "setter") set. Under real Juju it can read this data, under the Harness you currently get a RuntimeError("remote-side relation data cannot be accessed during a relation-broken event"), which doesn't seem to match reality. @.*** https://github.com/jameinel any idea why the Harness tries to be different/stricter than reality here?)

You can see this from the following log line:

unit-webapp-0: 15:25:26 INFO unit.webapp/0.juju-log db:0: webapp _on_db_relation_broken: <ops.model.Relation db:0> event_data={'db_password_id': 'secret:cf0c1lrs26oc7aah2260'}

The webapp charm is the "getter" in this case, and it was able to read that data -- during relation-broken -- that the database charm had set.

— Reply to this email directly, view it on GitHub https://github.com/canonical/operator/issues/888#issuecomment-1386355879, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABRQ7IB6DJU7POVV7INTODWS5E5NANCNFSM6AAAAAATZLT5HE . You are receiving this because you were mentioned.Message ID: @.***>

jameinel avatar Jan 18 '23 02:01 jameinel

Iiuc, this means that the following pattern is wrong:

def _on_relation_departed(self, _):  # or broken
    self._update_config()  # regenerate everything from current rel data

and instead we should do something like:

def _on_relation_departed(self, event):  # or broken
    self._update_config(excluding=event.relation.data)

Is that correct?

In other words, from within relation-departed/broken:

  • is event.relation included in self.model.relations?
  • is event.relation.data included in self.model.relations[x].data?

sed-i avatar Jan 18 '23 19:01 sed-i

Hi everyone, chiming in on this; is what @sed-i proposed the pattern we're supposed to follow?

lucabello avatar Apr 18 '23 14:04 lucabello

@PietroPasotti just gave me an idea: If we always defer a relation-broken event, then next hook (on update-status the latest) there will be no data left in relation data, so charm code could operate on the entire relation data, i.e. without needing to work with the delta that a relation-broken implies.

This is not a great pattern, but it conveys well our dissonance about relation-broken.

sed-i avatar Jun 02 '23 14:06 sed-i

FYI in some cases (but not all) accessing the remote application data in a relation broken event causes an error

See: https://bugs.launchpad.net/juju/+bug/1960934 https://github.com/canonical/operator/blob/734e12dcfde93d7081aed5573e011128d98fd84a/ops/model.py#L1341-L1349 https://github.com/canonical/mysql-router-k8s-operator/issues/73

carlcsaposs-canonical avatar Jun 02 '23 18:06 carlcsaposs-canonical

What the Kubeflow team has seen with istio-pilot is like what @carlcsaposs-canonical reports. In live Juju when handling a relation-broken event:

  1. sometimes event.app=DEPARTING_APP (and I think relation.app is the same thing? probably an alias?)
  2. sometimes event.app=None

Up until today, we had only seen case (1) and whenever we saw it, we also knew that the departing application's data is still in the relation data bag. We handled this by popping the departing data before using the data bag.

# (simplified version - differs slightly from the link)
if isinstance(event, RelationBrokenEvent):
    relation_data.pop((event.relation, event.app))

Now that sometimes we see event.app=None, I wonder if we should instead do something like:

if isinstance(event, RelationBrokenEvent):
    try:
        relation_data.pop((event.relation, event.app))
    except KeyError:
        log_and_pass  # ?

The one question I have is whether, when event.app==None, are we guaranteed that the departing application's data has been removed from the databag? If not, that will cause trouble as we can't pop it

ca-scribner avatar Jun 14 '23 19:06 ca-scribner

@jameinel Per the above and per https://bugs.launchpad.net/juju/+bug/1960934, it seems Juju is sometimes setting JUJU_REMOTE_APP but sometimes not setting it. Do you think that could be fixed on the Juju side? Or I guess we could change Juju to always not set it, but that might be too breaking.

benhoyt avatar Jun 15 '23 01:06 benhoyt

https://bugs.launchpad.net/juju/+bug/2024583

sed-i avatar Jun 21 '23 13:06 sed-i

Also, I think we usually don't want to see the stale data even on relation-departed:

  1. apps foo and bar are related.
  2. bar scaled down from 2 to 1.
  3. foo received relation-departed, but not broken.
  4. foo finishes running with the old data view which included the departing unit, and will only be able to act on the new data view (without bar/2) on update-status.

Real world example:

  1. Alertmanager is scaled from 2 to 1
  2. Prometheus needs to regenerate its yaml config file with only one target under the alertmanagers section.

sed-i avatar Jul 05 '23 17:07 sed-i

This issue is related to a frequent point of friction in charming: reconciling holistic vs deltas approaches.

  • Relation events give us deltas: unit/x joined or unit/x departed, and now we need to append/pop a section to/from an existing config file.
  • Idempotentency and robustness call for holistic: for example, on config-changed after upgrade we need to iterate over all relations to construct a full config from scratch.

sed-i avatar Jul 06 '23 16:07 sed-i

Need to consider further what to do here. Possibly related to https://github.com/canonical/operator/issues/940 work.

benhoyt avatar Sep 29 '23 02:09 benhoyt

Related: https://github.com/canonical/operator/issues/940#issuecomment-1623559826 (difference in usage between local and remote app data during relation-broken)

carlcsaposs-canonical avatar Oct 02 '23 09:10 carlcsaposs-canonical

This is fixed in ops 2.10, isn't it?

PietroPasotti avatar Feb 09 '24 08:02 PietroPasotti

This is fixed in ops 2.10, isn't it?

I believe the relation data is still accessible—and I think it should be, for the local app/unit databags

carlcsaposs-canonical avatar Feb 09 '24 09:02 carlcsaposs-canonical

@tonyandrewmeyer is going to investigate this further and then we'll make a decision here.

benhoyt avatar Mar 14 '24 03:03 benhoyt

There are three issues covered in this ticket:

JUJU_REMOTE_APP not being set

The Juju team say that this should always be set, so that's how ops behaves. It does seem like that may not always be true, which the Juju team would consider a bug. That needs to get fixed in Juju, rather than ops trying to work around it. The bug is marked as incomplete because it couldn't be reliably reproduced, but there are more recent comments that indicate that maybe it can be now.

As far as I know, Charm-Tech has no "bug slots", so we can't get movement on this. However, I assume the charming teams do have "bug slots", and I'd suggest that one/more of them use some of those to get that ticket moved back into an active status and resolved.

Whether a relation should show in relation-ids when broken

Resolved in this Juju bug (and the earlier ops change that has the same behaviour), for better or worse.

Harness matching Juju behaviour

Note that Scenario doesn't add any additional checks for getting or setting, apart from a Juju version check (>2.7), and a check that probably ought to be an assert (it should never be reached). So the behaviour there should be the same as ops provides with a real Juju backend.

When reading, ops itself checks if Relation.app is None (it's typed to never be) and raises if so. I feel we should remove this check: according to Juju (see the first issue) this shouldn't happen, and it seems cleaner to just let Juju provide an error if there should be one (but again, that shouldn't be the case). That would mean that instead of relying on this for the check if it's a peer relation, we would just assume that if app is None, then it's not a peer relation (this seems better than having ops raise an AttributeError if app isn't set).

When reading, ops also checks if the relation app name matches the local app name and if so (and the unit isn't the leader) raises. I feel we should remove this check too: it again seems cleaner for Juju to provide the error, like:

$ juju exec --unit provider-charm/6 -- relation-get -r 22 - dummy --app
ERROR permission denied

We could keep this backwards compatible by checking for "permission denied" and raising RelationDataAccessError in that case rather than a generic ModelError. We'd have to move the check into Harness in this case, though, and Scenario would also need to learn to do this check. On balance, I think it would have been the better choice originally, but doesn't have enough benefit to do now.

When writing, ops again duplicates the Juju logic (can't write to a different unit, can't write to a remote app databag, can't write to the local app databag unless leader). I again feel that we would have been better just letting this logic live in Juju (a failure would mean an extra subprocess call, but failures ought to be rare and a subprocess call is inexpensive compared to most charming work), but that ripping it out now doesn't really offer a significant benefit.

When reading, Harness only checks that the relation is in the relations that have been added. Outside of events (and __init__, as of next ops) some of the run-time checks are also disabled.

When writing, Harness has the extra "is this relation-broken" check. We should drop that - there's already separate validation that a unit can't write to the app databag for a different app, and Juju allows writing to the local app databag (if leader) in relation-broken, even though that's a strange thing to do.

tonyandrewmeyer avatar Jun 04 '25 02:06 tonyandrewmeyer

As far as I know, Charm-Tech has no "bug slots", so we can't get movement on this. However, I assume the charming teams do have "bug slots", and I'd suggest that one/more of them use some of those to get that ticket moved back into an active status and resolved.

Sinan commented in Matrix that Charm-Tech could actually push this and that he'd have the Juju team look into it 🎉

When reading, ops itself checks if Relation.app is None (it's typed to never be) and raises if so. I feel we should remove this check: according to Juju (see the first issue) this shouldn't happen, and it seems cleaner to just let Juju provide an error if there should be one (but again, that shouldn't be the case). That would mean that instead of relying on this for the check if it's a peer relation, we would just assume that if app is None, then it's not a peer relation (this seems better than having ops raise an AttributeError if app isn't set).

I'm wondering if this is actually worth doing or not (and there's been a little push-back from posting about it in Charm Development in Matrix).

A lot of cases already work even if JUJU_REMOTE_APP is missing. If the relation is a peer, then the app is always set because there isn't a 'remote' app. If the relation is from self.model.relations then the app typically comes from relation-list, similarly with self.model.get_relation - we have quite a lot of fallbacks.

It seems like changing the behaviour here risks introducing a subtle behaviour change that will cause existing charms problems, and it's all to patch over a Juju bug that we hope will get fixed anyway. So leaving in the existing bandage is probably best.

(We still want to fix the Harness relation-broken check).

tonyandrewmeyer avatar Jun 04 '25 22:06 tonyandrewmeyer