Bloom filters?
Might be able to speed things up pretty substantially, but need to investigate thoroughly.
I think Hadoop has 3 implementations of bloom filters. But otherwise, where are you wanting to plug them in hank?
My initial thinking is to put a small bloomfilter in Cueball files that can be loaded on startup. Then, when making requests, we can check the filter first and decide whether we should do any disk access at all.
I'm also wondering if it makes sense to have one small bloomfilter for each Cueball block, rather than one big filter for all the blocks. There might be benefits to be had in terms of only hashing a portion of the keys that are not already used in partitioning and block positioning.
On Sat, Apr 2, 2011 at 8:43 AM, gsharma < [email protected]>wrote:
I think Hadoop has 3 implementations of bloom filters. But where are you wanting to plug them in hank?
Reply to this email directly or view it on GitHub: https://github.com/bryanduxbury/hank/issues/9#comment_949003
I might be able to take this up in a little bit and investigate the two scenarios' performance: -small bloom filter for each cueball block -single bloom filter for all blocks