iris hexdigest of cubes with lazy data

📰 Custom Issue

The iris.common.metadata.hexdigest was made part of the public API in #4020. This is a useful feature, but may have confusing interaction with cubes with lazy data - a different hash will be returned for exactly the same data, depending on whether it has been realised or not. It may be desirable to have an option in the call to hexdigest that is independent of the state of the underlying data.

Feb 23 '21 15:02 jamesp

I think it might be incorrect / unhelpful for hexdigest to automagically realise lazy data that's given to it.

Personally, I think the onus of responsibility should be explicitly with the caller.

To start the conversation...

How about offering something like hexdigest(item, native=True) by default, which will calculate the non-cryptographic hash of the object passed to it "as is" i.e., if the object is lazy, then the hash is of that lazy object, and if the object is real, then the hash is of that real object.

If native=False, then hexdigest will realise any lazy data associated with the object (if it is indeed lazy) and return the hash of that real data. Note that, the actual item will still remain lazy outside hexdigest, however the lazy compute will be performed, which may ultimately be very expensive... but that's what the caller explicitly wanted.

Note that, some clarity is required here.

If the item passed is a richer data type, like a Cube or a DimCoord that has lazy payload, then hexdigest shouldn't have the smarts to determine whether it can identify and realise any lazy internals before computing the hash - I'd consider that completely out of scope here. Is that unreasonable? What do you think?

@jamesp What did you imagine when you proposed this?

Context: The original use case where hexdigest is used internally within iris is to compute the hash of attributes dictionaries on a Cube or coordinate. The hash is used to compare two dictionaries for equivalence... the reason why a hash is computed is because this is a cheaper way to handle comparing dictionaries that may contain numpy arrays or other compound data structures; the win is that hashing doesn't need to understand the internals of the dictionaries to do its hash. Oh, and the hash compute is very fast.

Feb 25 '21 00:02 bjlittle

I think it might be incorrect / unhelpful for hexdigest to automagically realise lazy data that's given to it.

Agree. I don't know if you've defined a set of rules for actions where iris should realise data, but this definitely doesn't feel like it fits in that category.

In your context the use-case is for checking equivalence, which is likely the use case for most users. The question is, should hexdigest(object) === hexdigest(object.realise_data())? The I don't think it's necessarily a bad thing if this isn't the case, but it should be explicit that this method can't be relied on for equivalence when working with lazy data. Many users won't even know they are working with lazy data (and there is probably an argument that those users likely won't also be using hexdigest). The native=True switch seems like a good option.

Feb 25 '21 09:02 jamesp

We should at least highlight the lazy vs real issue in the doc-string so that users are aware, for sure :+1:

I'm thinking what we actually need here is something like hexdigest(item, native=True) and hexdigest_compare(item, other), where hexdigest_compare is explicitly used for comparing hashes between two objects, returning bool, and ensures that if either is real or lazy, makes both real and then compares, otherwise compares real with real, and lazy with lazy.

What about that? Seems obvious and clean...

Feb 25 '21 10:02 bjlittle

Also, I think hexdigest and hexdigest_compare potentially cover all use case permutations i.e., the corner case is where the user wants to have a hash failure that highlights when a real and lazy object are compared e.g., hexdigest(a) == hexdigest(b) is False because a is real and b is a lazy version of a, whereas hexdigest_compare(a, b) is True... if you know what I mean.

Feb 25 '21 11:02 bjlittle

yep, sounds like a good setup 👍

Feb 25 '21 13:02 jamesp

@jamesp Fancy taking it on? :wink:

If you don't I will... just thought I'd give you first dibs :smile:

It's fine either way, don't feel obliged :+1:

Feb 25 '21 14:02 bjlittle

I will! Probably not this week tho... :)

Feb 25 '21 15:02 jamesp

Thanks! Nice one fella :partying_face:

There isn't a rush on this, but it's nice to be ahead of the curve for once... having the functionality before someone asks.

Thanks for raising the issue. Awesome stuff!

Feb 25 '21 15:02 bjlittle

In order to maintain a backlog of relevant issues, we automatically label them as stale after 500 days of inactivity.

If this issue is still important to you, then please comment on this issue and the stale label will be removed.

Otherwise this issue will be automatically closed in 28 days time.

Feb 18 '24 00:02 github-actions[bot]

This stale issue has been automatically closed due to a lack of community activity.

If you still care about this issue, then please either:

Re-open this issue, if you have sufficient permissions, or
Add a comment stating that this is still relevant and someone will re-open it on your behalf.

Mar 17 '24 00:03 github-actions[bot]