Efficient range operations over multiple Set objects
I am building an application that generates new FST objects every hour. I would like to be able to do efficient range operations over a group of these FST files. Conceptually, something like combining the __getitem__ implementation with the OpBuilder. Basically, returning a KeyStreamIterator for multiple Set objects' range operation (pseudo code:
s1 = Set.from_iter(["bar", "baz", "foo", "moo"])
s2 = Set.from_iter(["bing", "jar, "foo", "moo"])
s2 = Set.from_iter(["bap", "bonk, "foo", "moo"])
(s1 + s2 + s3)["b":"f"]
The existing API seems cumbersome for doing operations across multiple. You have to pick a (possibly arbitrary) instance, then call .union / .intersection etc with the rest of the objects. Would it be possible to have some sort of MultiSet class that could perform efficient calls across multiple Set objects?
I can take a crack at implementing it; it appears the underlying Rust API would support this.
it appears the underlying Rust API would support this
It does indeed! Ping me if anyone needs help with this.
@BurntSushi : can you give me an example of doing this in Rust? I can probably figure out how to translate it to Python/cffi. I'm wondering if we'd to add something to the C layer for this.
@davidblewett I believe this is what you're after? https://docs.rs/fst/0.3.0/fst/#example-searching-multiple-sets-efficiently
(That example does a little more than what you're asking, by applying a regex search. But you can just remove that bit.)
@BurntSushi : I was looking at that, but the regex bit did indeed throw me off. I was looking more closely at OpBuilder; it appears to aggregate calls across multiple FST objects, but I don't see one for range. The range operations appear to use a StreamBuilder, which appears to only be offered by a single Set object?
For example:
let mut stream = set.range().ge("b").lt("e").into_stream();
I don't see a corresponding .range() method for OpBuilder?
@davidblewett You have to apply the range to each set, e.g.,
let (set1, set2, set3) = ...;
let mut stream = OpBuilder::new()
.add(set1.range().ge("b").lt("e"))
.add(set2.range().ge("b").lt("e"))
.add(set3.range().ge("b").lt("e"))
.union();
Got it. Will fiddle with it.
@jbaiter : it seems the C interface could be simplified a lot if instead of functions specifically for set::Difference / set::Intersection / set::Union , they were defined to take set::Stream? I'm going to have to add a version of the fst_set_opbuilder_* / free /next functions anyway for that type.
Are there some sort of safety guarantees in retaining the more specific versions?
@BurntSushi : I tried in jbaiter/python-rust-fst#10 , but hit a probably dumb Rust mistake on my part.
@BurntSushi would you mind taking a look at https://github.com/jbaiter/python-rust-fst/pull/10 ? I'm not sure how to fix that compiler error.
@davidblewett Ah yup! Apologies. I missed your previous ping. :-)