InternedStrings.jl icon indicating copy to clipboard operation
InternedStrings.jl copied to clipboard

Consider pros and cons of alternative data structure for pool

Open xiaodaigh opened this issue 8 years ago • 6 comments

Currently pool is a WeakRefDict. Wondering if it can be made into something else to allow for more operations to be more efficient e.g. sorting/grouping. A B-tree perhaps?

xiaodaigh avatar Jan 22 '18 04:01 xiaodaigh

You don't perform operations on the bool. It's not user-facing, and it is not safe for users to touch.

It needs exactly one operation: intern!, which isn't really related to sorting or grouping. I guess that operation could be called getkey!. It needs to check for the set containing an element equal the input, and if so return that element, if not add the input, and return it.

The pros and cons of alternative data structures are not around making any other operations efficient. So here they are:

Custom Hashfunction hashset/dict

Pro:

Could be faster than current implementation

Con:

Work to implement basically from ground up

TreeSet/dict

Pro

Comparisons could be faster than current hash function

Con

Gets slower the more elements it contains. Technically so does Hashmap, but this is worse. Slowness gets worse if strings share common prefixes.

Trie

All pros and cons of TreeSet/dict, but additionally

Pro

Uses less memory in storage

Con

Big Con: I can't workout away to actually pull this off that actually results in being able to return strong references to its contents, since it gains those memory savings by discarding the original strings and only storing parts. I guess maybe if we had a RopeString class still, but then the strings themselves would be slow ...

There are a few other finite-automata-ish string storing structures with the same kinda problems as tries but perhaps more so. Since a set of strings can be reduced down into definite a finite-automata that only accepts strings from that set. But I think that line of reasoning is a deadend of slowness

oxinabox avatar Jan 22 '18 05:01 oxinabox

Yeah, but if the pool is stored in a data structure such that it's easy to obtain each element's rank within the pool then one can sort an InternedStrings vector quite fast using counting sorting. Suppose there are only 3 possible elements "a", "b", "c" and so their rank is 1,2,3. We can scan through the vector and keep a count of the rank, and then use the counts to sort the vector using the same algorithm as counting sort. I don't think this is currently possible.

Nice properties for the pool data structure are easy to look up, easy to insert, easy to delete, and easy to find rank

xiaodaigh avatar Jan 22 '18 06:01 xiaodaigh

Rather than talking about properties of a pool do you want to take a step outwards and talk about properties of the InternedString

Right now, the property of the interned string is:

  • very fast equality checking
  • use less memory

You are suggesting adding:

  • very fast < / > checking

Right?

oxinabox avatar Jan 22 '18 07:01 oxinabox

Yeah

xiaodaigh avatar Jan 22 '18 10:01 xiaodaigh

Due to the removal of the InternedString type this is not directly possible anymore, While there is still some value in the consideration, and perhaps we could definate some kinda of order(a,b)= getind(intern(a)) < getind(intern(b)) type deal. It is looking less likely.

I'm so dubious about the using of Tries for performance or for memory decreasing, for most use cases. But like maybe it could be a thing. It might actually be faster than hashing I guess.

oxinabox avatar May 05 '18 15:05 oxinabox

I used to use B+ trees for interning (this was for huge tables, millions of entries, the B+ trees were buffered in shared memory, with TB of disk). It worked very well for us (and this was for NLP work), a nice thing was being able to get ordered output fast (at least, by Unicode codepoints), made finding prefixes trivial.

ScottPJones avatar May 09 '18 12:05 ScottPJones