python-rust-fst icon indicating copy to clipboard operation
python-rust-fst copied to clipboard

Efficient range operations over multiple Set objects

Open davidblewett opened this issue 8 years ago • 12 comments

I am building an application that generates new FST objects every hour. I would like to be able to do efficient range operations over a group of these FST files. Conceptually, something like combining the __getitem__ implementation with the OpBuilder. Basically, returning a KeyStreamIterator for multiple Set objects' range operation (pseudo code:

 s1 = Set.from_iter(["bar", "baz", "foo", "moo"])
 s2 = Set.from_iter(["bing", "jar, "foo", "moo"])
 s2 = Set.from_iter(["bap", "bonk, "foo", "moo"])
(s1 + s2 + s3)["b":"f"]

The existing API seems cumbersome for doing operations across multiple. You have to pick a (possibly arbitrary) instance, then call .union / .intersection etc with the rest of the objects. Would it be possible to have some sort of MultiSet class that could perform efficient calls across multiple Set objects?

I can take a crack at implementing it; it appears the underlying Rust API would support this.

davidblewett avatar Jan 09 '18 23:01 davidblewett

it appears the underlying Rust API would support this

It does indeed! Ping me if anyone needs help with this.

BurntSushi avatar Jan 09 '18 23:01 BurntSushi

@BurntSushi : can you give me an example of doing this in Rust? I can probably figure out how to translate it to Python/cffi. I'm wondering if we'd to add something to the C layer for this.

davidblewett avatar Jan 09 '18 23:01 davidblewett

@davidblewett I believe this is what you're after? https://docs.rs/fst/0.3.0/fst/#example-searching-multiple-sets-efficiently

BurntSushi avatar Jan 10 '18 00:01 BurntSushi

(That example does a little more than what you're asking, by applying a regex search. But you can just remove that bit.)

BurntSushi avatar Jan 10 '18 00:01 BurntSushi

@BurntSushi : I was looking at that, but the regex bit did indeed throw me off. I was looking more closely at OpBuilder; it appears to aggregate calls across multiple FST objects, but I don't see one for range. The range operations appear to use a StreamBuilder, which appears to only be offered by a single Set object?

davidblewett avatar Jan 10 '18 00:01 davidblewett

For example: let mut stream = set.range().ge("b").lt("e").into_stream();

I don't see a corresponding .range() method for OpBuilder?

davidblewett avatar Jan 10 '18 00:01 davidblewett

@davidblewett You have to apply the range to each set, e.g.,

let (set1, set2, set3) = ...;
let mut stream = OpBuilder::new()
    .add(set1.range().ge("b").lt("e"))
    .add(set2.range().ge("b").lt("e"))
    .add(set3.range().ge("b").lt("e"))
    .union();

BurntSushi avatar Jan 10 '18 00:01 BurntSushi

Got it. Will fiddle with it.

davidblewett avatar Jan 10 '18 00:01 davidblewett

@jbaiter : it seems the C interface could be simplified a lot if instead of functions specifically for set::Difference / set::Intersection / set::Union , they were defined to take set::Stream? I'm going to have to add a version of the fst_set_opbuilder_* / free /next functions anyway for that type.

Are there some sort of safety guarantees in retaining the more specific versions?

davidblewett avatar Jan 10 '18 01:01 davidblewett

@BurntSushi : I tried in jbaiter/python-rust-fst#10 , but hit a probably dumb Rust mistake on my part.

davidblewett avatar Jan 10 '18 15:01 davidblewett

@BurntSushi would you mind taking a look at https://github.com/jbaiter/python-rust-fst/pull/10 ? I'm not sure how to fix that compiler error.

davidblewett avatar Jan 17 '18 23:01 davidblewett

@davidblewett Ah yup! Apologies. I missed your previous ping. :-)

BurntSushi avatar Jan 18 '18 00:01 BurntSushi