attrs icon indicating copy to clipboard operation
attrs copied to clipboard

Add a more idiomatic way to define hashing

Open hynek opened this issue 8 years ago • 15 comments

The current way of using True/False/None is confusing even to me.

I propose we use an Enum (probably just class on legacy Python) a la:

class Hashing(Enum):
    SMART = "smart"  # None → new default
    BY_ID = "by_id" # False
    WRITE_FOR_ALL = "write_for_all" # True
    WRITE_FOR_NONE = "write_for_none" # True, but the default on attr.ibs is False

The last two names are terrible of course.


I probably could be talked into depending on https://pypi.org/project/enum34/ on legacy Python.

Opinions?

P.S. Now while typing this out I realized that this could solve a long-standing complaint both I and many users have: assuming you want an attrs-generated method based on a small subset of attr.ibs, you have to write a lot of hash|init|repr=False. Ditching bools for enums would allow for that. 🤔 But let’s talk about that in a different ticket.

hynek avatar Jun 02 '17 09:06 hynek

We should definitely do something to make the hashing behavior clearer.

I'm currently working on a PR to "fix" slots hashing behavior, and attr.s(hash=False) is really throwing me off.

The docs say nothing about what should happen when hash=False.

My personal understanding would be for attrs to do exactly nothing, we're saying do not touch __hash__. However this is complicated. I will demonstrate.

Assume a very simple class, with hashing disabled:

@attr.s(hash=False)
class A:
    a = attr.ib()

The equivalent of this class is: https://gist.github.com/Tinche/d8c1dadfaa3aa4618b736a61e752b3e1

Try pasting them both. The attrs one will be hashable, due to falling back to __object__.__hash__. The normal one will not be.

This is because in the normal class, the interpreter will see the __cmp__ and insert __hash__ = None. In the attrs class, the __cmp__ method isn't there when the class is defined; we put it into the class after the fact. So the interpreter won't touch __hash__ and it will fall through to object.

This is the root cause of why our slots behavior is different.

Tinche avatar Jun 05 '17 14:06 Tinche

Yeah so that’s settled.

Ideas for better names for WRITE_FOR_ALL/NONE? @glyph? @njsmith?

P.S. the replacement/slots thingie explains a lot of my own confusion.

hynek avatar Jun 06 '17 11:06 hynek

What functionalities are we aiming for exactly?

I'm guessing:

  • make hashable by generating a hash function based on attributes
  • make hashable, but attributes are opt-in instead of opt-out
  • make hashable by using identity-based hashing
  • make unhashable
  • delegate to an existing self.__hash__, whatever that may be

(edit) Also:

  • smart

Tinche avatar Jun 06 '17 20:06 Tinche

I don't really understand the original post, but regarding the cases that need to be supported: I think it helps narrow things down to remember that for any given __eq__ there are exactly two valid ways of defining __hash__. Either it must include exactly the same things that go in the __eq__ calculation, or else you have to set __hash__ = None. This is just a fact about the language; if you violate it then dicts stop working.

So if attrs is generating its own __eq__, then the only meaningful decision is "do you want this object to be hashable?", and right now the three answers are "yes" (True), "no" (False), and "guess" (None, meaning yes for immutable objects and no otherwise). (Except apparently the "no" case is buggy? I'm pretty sure that the intention was for cmp=True, hash=False to make an unhashable class.) And then on a per attribute level you can also say yes/no/guess, but I can't think of any time that "guess" is wrong; the other options just give you interesting ways to break your class.

And if attrs isn't generating its own __eq__, then not sure why it would ever want to generate a __hash__. If the user is taking responsibility for __eq__ then we should stay out of the way and give them the regular python behavior imo.

njsmith avatar Jun 06 '17 21:06 njsmith

That said, I can see the case for having nicer shorthands for controlling which attribs go into the __eq__ calculation, like toggling it between opt-in and opt-out. But there's no reason to give __hash__ a separate set of toggles here; it should follow what __eq__ does.

njsmith avatar Jun 06 '17 22:06 njsmith

Actually, you specifically asked for being able to fall back to hashing by ID – that’s why it’s in there. :) And I think that’s a great and useful feature, so I’m not trying to push the blame in your direction – just reminding. :)

The point of this ticket isn’t really to invent new behaviors, it’s just making the existing ones more understandable. Especially because one of the happened by accident (see also #205).

But maybe…how about this:

class Hashing(Enum):
    SMART = "smart"  # None → new default
    UNHASHABLE = "unhashable"  # same as None for unfrozen
    BY_ID = "by_id" # False
    BY_ATTRS = "opt_out" # True
    BY_ATTRS_OPT_IN = "opt_in" # True, but the default on attr.ibs is False

hynek avatar Jun 07 '17 05:06 hynek

Falling back to hashing by id is definitely useful! If, and only if we are also falling back to doing __eq__ by id. That falls under my line about "If the user is taking responsibility for __eq__ then we should stay out of the way and give them the regular python behavior"

njsmith avatar Jun 07 '17 08:06 njsmith

Yeah so the fallback kind of happened by accident (cf. Tin’s explanations) but this time we have the chance to fix it properly including the deprecation year if we switch to enums.

hynek avatar Jun 07 '17 09:06 hynek

The thing where cmp=True, hash=False ends up with a generated __eq__ combined with __hash__-by-id just sounds like a straightforward bug to me. It just makes it so you can have two objects that compare equal but if you put them both into the same set then it's literally random whether you end up with 1 object added or 2 objects added.

Beyond that I'm mostly just kinda lost about what all these enums are supposed to be or what the overall goal is here. I think I showed up in the middle of the conversation :-)

njsmith avatar Jun 07 '17 09:06 njsmith

So I guess we have to break bw-compat again and set __hash__ to None by hand if __eq__ is set? :|

hynek avatar Jun 12 '17 10:06 hynek

So I guess we have to break bw-compat again and set hash to None by hand if eq is set? :|

Woah there, Tex. Ease your finger up off that trigger :-).

In this case I'm not convinced a breaking change is necessary, because there's no alternate behavior to suggest: a warning would be sufficient. The user has explicitly asked for nonsense, and if the nonsense appears to work today then it's fine to deprecate it for a while before raising an error.

glyph avatar Jun 12 '17 17:06 glyph

Well the problem (which Tin explained elsewhere, I'm gonna add here for posterity) is: normal Python class behavior is setting dunder hash to None, if there's an Dundee eq. If not, it inherits object's by-id semantics. Now our eq is added after python makes this decision (except when using slots) so object's hash remains.

If slots is true it behaves differently because we build the class differently and it becomes unhashable.

IOW

  • we have conflicting behavior between slotted and not,
  • and hash=False doesn't actually mean what it looks like (Don't do anything about hash.) because of a weird side-effect.

Dunno. :|

hynek avatar Jun 12 '17 19:06 hynek

Now our eq is added after python makes this decision (except when using slots) so object's hash remains

Oh, that's deeply unfortunate. I was wondering if the modifying-the-class-after-the-fact thing would come to bite us at some point; looks like it has. Perhaps we should be throwing away the first type and replacing it with a fresh one? (Although that's probably some even worse / more horrible compatibility break than this...)

glyph avatar Jun 12 '17 22:06 glyph

and hash=False doesn't actually mean what it looks like (Don't do anything about hash.) because of a weird side-effect.

Focusing first on getting the exact right behavior with the new, good enum would be the right way to prioritize this. If we need another compat break (and gosh I hope not) then that's a separate thing.

glyph avatar Jun 12 '17 22:06 glyph

JFTR, I started to work on this and ran into another “interesting” fact: the behavior Nathaniel described is specific to Python 3. Python 2 will happily hash by ID if there's a __eq__.

hynek avatar Oct 06 '17 14:10 hynek