OpenImageIO feat(ustring): ustring hash collision protection

The gist is that the ustring::strhash(str) function is modified to strip out the MSB from Strutil::strhash. The rep entry is filed in the ustring table based on this hash. So effectively, the computed hash is 63 bits, not 64.

But rep->hashed field consists of the lower 63 bits being the computed hash, and the MSB indicates whether this is the 2nd (or more) entry in the table that had the same 63 bit hash.

ustring::hash() then is modified as follows: If the MSB is 0, the computed hash is the hash. If the MSB is 1, though, we DON'T use that hash, and instead we use the pointer to the unique characters, but with the MSB set (that's an invalid address by itself). Note that the computed hashes never have MSB set, and the char*+MSB always have MSB set, so therefore ustring::hash() will never have the same value for two different ustrings.

But -- please note! -- that ustring::strhash(str) and ustring(str).hash() will only match (and also be the same value on every execution) if the ustring is the first to receive that hash, which should be approximately always. Probably always, in practice.

But in the very improbable case of a hash collision, one of them (the second to be turned into a ustring) will be using the alternate hash based on the character address, which is both not the same as ustring::strhash(chars), nor is it expected to be the same constant on every program execution.

Jul 21 '24 18:07 lgritz

Comments @cmstein @sfriedmapixar @etheory ?

Aug 07 '24 04:08 lgritz

Maybe @chellmuth too?

Aug 07 '24 04:08 lgritz

I'm OK with any of these options as long as collisions are handled. I do agree with @sfriedmapixar about the simplicity of debugging if the top bit is 1 for a hash and 0 for a string though.

Also remember that on every current system, the bottom few bits are zero due to alignment, and the top few are zero due to memory allocation guarantees: https://stackoverflow.com/questions/46509152/why-in-x86-64-the-virtual-address-are-4-bits-shorter-than-physical-48-bits-vs which means you can often rely on the top 12 to 16bits also being zero. Keep that in mind when you do any of these tricks. We use these bits in glimpse for storage of data, so it's quite useful.

Aug 08 '24 05:08 etheory

Something that has been nagging at my mind... I know that the MSB will always be 0 for a real address on x86, but do we feel confident that this will also be true for ARM, RISC V, or others that may come along?

I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we "|1" the hashes to set the low bit, then that's also a simple way to distinguish pointers from hashes and will be true of any future architecture.

Thoughts?

Aug 08 '24 06:08 lgritz

The reason I really like empty string to be all 0 bits (and thus thought that must mean that it's pointers, not hashes, with the extra bit set) is so we can memset to 0 a region and know that we've made valid (empty) ustrings or ustringhashes.

Aug 08 '24 06:08 lgritz

I also considered using the LSB for this, also safe because the allocations are always aligned, the characters will never start on an odd address, so if we "|1" the hashes to set the low bit, then that's also a simple way to distinguish pointers from hashes and will be true of any future architecture.

Setting the LSB hash will reduce the hash quality in a risky way. Suppose for instance someone did: my_array[hash%my_array.length()] where the length is 8 (it could be any size, but this is simple to show). The hash%8 will result in just the three lowest signed bits being used. But note that your proposal has the LSB at 1, so now we just have 2 bits in play and so only half the array will be referenceable.

Aug 08 '24 22:08 ThiagoIze

Right, gotcha. That's a good reason to stick with MSB.

It's ok, we'll have asserts that no real address has the high bit set. We'll have faith that we'll catch it right away if anybody ports to a hardware platform that might use that part of the address space.

Aug 08 '24 23:08 lgritz

OK, in my tree (not yet pushed here) I recoded it all to make the MSB 1 for "original" hashes, and 0 for rehashes (which are replaced by the char ptr as the return value for ustring.hash()). I believe it's all working, but there is a tradeoff involved that I'm not entirely happy with. Let me review the issues:

In the table (invisible to the user), each entry has a hashed field, and the chars themselves. The MSB of the hashed field reveals if this is the first entry with that hash (used to be 0, now is 1 in my new draft), or if it is a subsequently encountered string that had the same 31 bit hash. The public hash() function returns rep->hashed if it's a first string entry having that hash, or the char address if it's a duplicate hash.

A property we've always had that I think is very important to retain is that both ustring and ustringhash are represented by a true 0 value for empty strings. In other words, it's special -- the ptr is null and the hash should be 0. This is what lets us zero out a block of memory and know that it's initializing any ustring or ustringhash therein to be valid, and represent an empty string.

Now then, the way I did it first in this PR trivially preserved that property, because MSB was 1 only for duplicate hashes, so the empty string is canonical and hashes to 0. The alternate way, where it would "want" to be a MSB 1 for canonical hashes, requires special handling of the empty string case in a variety of places. I think this is a little confusing and error-prone, and certainly has a runtime cost for the extra testing and branching.

@ssteinbach and others pointed out that it's nice for the pointers to have the MSB=0 rather than =1, because then you can just cast it to a pointer to get to the characters, especially helpful when debugging.

But here's where I want to push back a little: Is it, really? The only case where you'd get a hash that's a pointer is for a 2nd string that had the same hash. This should be exceptionally rare, so it's not something that would come up... almost never... while in the debugger looking at a ustringhash. Usually you'd just see the hash, so there's not a direct way to get to the chars, so it doesn't matter if MSB is 0 or 1 for that case. (And if we set MSB on the pointer, that just means to get to the chars, you'd need to mask it... awkward, but how often does that really come up? You could also just CALL -- yes, in the debugger -- ustringhash.c_str() to get to the chars.)

So here's the crux of it:

MSB=1 for duplicate hash:

PRO: no special handling of empty string case to make the hash 0
CON: "pointer substituted for duplicate hash" needs to have MSB masked off to get to the chars, which is painful in the debugger but almost never happens.

MSB=0 for duplicate hash:

PRO: duplicate hashes trivially castble to pointers, easy in the debugger, but this almost never happens.
CON: extra operations and potentially a branch to ensure ustringhash("") == 0, in several places that actually execute frequently.

Thoughts? Are people willing to pay (slightly) in runtime perf to have this property of trivial cast of duplicate hashes to pointers in the debuger, that you'll almost never encounter in real life?

Aug 09 '24 22:08 lgritz

You know what, add a static function to do the cast in the debugger and you solve all the issues. Should be straight forward enough. Then you can just call the correct function in the debugger directly. Just document it clearly and that should resolve the issue no?

Aug 10 '24 05:08 etheory

You know what, add a static function to do the cast in the debugger and you solve all the issues.

Isn't that just the existing ustringhash.c_str()?

Aug 10 '24 14:08 lgritz

You know what, add a static function to do the cast in the debugger and you solve all the issues.

Isn't that just the existing ustringhash.c_str()?

If you only have the pointer for some reason, you cannot call the function on it (though I guess you could cast it....). My point is that if you provide a free function to convert the bits for display/conversion, then you can just call it at any time in a debugger in any situation. Whether you have the object reference or not. If you can just cast the address and call the function, then sure, it's fine as is.

Aug 12 '24 01:08 etheory

If you have the pointer, you already have the pointer.

If you have a hash value h as a ustringhash, you get to the chars with h.c_str().

If you have a hash value v as a uint64, you get to the chars with ustringhash(v).c_str().

Aug 12 '24 05:08 lgritz

Does my last answer assuage concerns, or do we need an additional utility function?

Aug 15 '24 17:08 lgritz

Yep all good thanks @lgritz - sorry for the slow turn around, I was on leave.

Aug 17 '24 14:08 etheory

@sfriedmapixar you had asked about the stab I took at swapping the bit to the other way as you originally suggested. If you're curious, it's here: https://github.com/lgritz/OpenImageIO/tree/lg-ustringhash2 (just one additional commit on top of what I have here in this PR)

I like this PR here better, as it avoids some extra logic (both potentially perf impacting, but maybe more importantly, error prone and easy to forget) that is necessary to special case empty string handing.

Let me know if you feel really strongly about doing it the other way. If you're ok with this, though, I'd like to merge it.

Aug 25 '24 20:08 lgritz