beanmachine icon indicating copy to clipboard operation
beanmachine copied to clipboard

An efficient persistent immutable map

Open wtaha opened this issue 4 years ago • 3 comments

Summary: I've often said that a big part of writing a compiler is implementing special-purpose hash tables; fortunately this is the first one I've had to make for Beanstalk.

I wish to do some experiments with building a map from rewritten AST nodes back to the original AST nodes; it will be helpful to have an immutable persistent dictionary for these experiments. Immutable because adding a new entry produces a new dictionary without mutating the old one. Persistent because for large dictionaries, the new dictionary can share most of the state of the old one, rather than doing an O(n) copy of the whole thing.

In this diff I've built an insert-only Hash Array Mapped Trie or HAMTrie for short, based on the data structure described in Phil Bagwell's paper "Ideal Hash Trees".

A trie is an n-ary tree typically used to store things like word lists. Each edge represents a portion of the word. For example we might have a 26-ary tree where the root has an edge "I" pointing to a node with edges "S" and "T", representing the words IS and IT. In a HAMTrie each node has up to 32 outgoing edges representing 5 bits of the hash value of the key.

Since most of those 32 potential edges are going to be nil most of the time, we'd like nodes to have the property that empty entries are "free". I therefore first implement Map32, a sparse map from the numbers 0-31 to Any; the map stores an n-ary tuple of values where n is the number of valid entries, and an integer bitmap indicates which entries are valid. We then just need some simple bit counting to turn integer keys into the actual offset into the values tuple.

Once we have that map, implementing the HAMTrie on top is relatively straightforward. Each node in the HAMTrie has a Map32 indexed by five bits of the key's hash. The Map32 entries are:

  • a (key, value) pair, if this is the only key in the map with these 5 bits of hash
  • a [(key, value), ...] list in the unlikely event that there are two or more unequal keys agreeing on all hash bits; hopefully this never happens.
  • a child HAMTrie, if there are two keys that agree on these five hash bits, but disagree on some other hash bits

If hashes are well distributed then we should have O(lg n) insertion and lookup time. Inserting a new item to produce a new dictionary should allocate O(lg n) new storage and re-use the rest; we only rebuilt the "spine" of the tree every time.

Differential Revision: D33521060

wtaha avatar Jan 12 '22 19:01 wtaha

This pull request was exported from Phabricator. Differential Revision: D33521060

facebook-github-bot avatar Jan 12 '22 19:01 facebook-github-bot

(Not sure if relevant/useful) there are some existing libraries implementing such maps, see https://github.com/sfermigier/awesome-functional-python#immutable--persistent-data-structures

sdementen avatar Jan 24 '22 22:01 sdementen

Hi @wtaha!

Thank you for your pull request.

We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

facebook-github-bot avatar Apr 20 '23 13:04 facebook-github-bot