Feature request: API / Rust bindings
Hi Giulio,
Thanks again for all the work on Fulgor. I really appreciate all the progressive innovations that have gone into it.
It's really a great tool, both algorithm- and data-structure-wise – however, unfortunately, in almost all (biology-centric) applications where we want to use it, we have to make modifications to the source code to make it work for our purposes, because its output and query parameters do not correspond to our needs. As the program changes quite a lot (e.g., it will soon migrate to the new SSHash), this way of using it is not really sustainable.
Is there a chance that Rust bindings (or, in the worst case, C++ bindings) could be provided for the main query functionality? Essentially, we are interested in the following three functions (two high-priority ones and one low-priority one):
-
open an index – this would simply open an index from disk, or fail if it's the wrong version of the index (e.g., from an older Fulgor), or if it's corrupted; this could also return a list of references names in the indexed order
-
query sequence – for a given string, it would return an array of numbers, with the number of matching k-mers in the first, second, third, ... reference. Important – we are, in principle, interested in all this information, not just the somehow pre-filtered output. And if some pre-filtering must be done for some reason (e.g., thresholding), it should allow two options: a minimum number of matching k-mers or a minimum proportion of matching k-mers (without any "advanced" filtering functionality to make it "smart").
-
get matching bit-vectors – returning the bit vectors of the matches in a given reference (expected to be used for a small subset of references, e.g., only the best-matching references)
In my opinion, 1.+2. might be not that difficult for you/. It would massively increase the applicability of Fulgor – we could then really use it as an essential building block in many of our applications.
(Also cc @rob-p @Francii-B)
Hi @karel-brinda and thanks for your kind words on the Fulgor index!
I agree that the codebase of Fulgor is changing frequently (for the better, though!) and this might result is some incompatibilities for the pipelines that rely on it. However, it is work in progress and we appreciate your suggestions a lot.
So, I haven't yet learnt Rust so I don't think Rust bindings are going to appear any time soon. (Of course, if you know someone who can be interested in developing them, it'd be fantastic! cc @rob-p) So, would you like to use Fulgor from Rust itself? On the other hand, I'm not sure what a "C++ binding" is in this context, as Fulgor is written in C++. Maybe you meant "Python" bindings?
I understand the functionalities 1 and 2. About 2: the number of matching kmers in ref i will be the number of kmers of the query sequence that belong to ref i. This does not have to do with the abundance of a given kmer in a reference, which is an information that Fulgor does not store yet. Right? Then this information can actually be already retrieved from the output of the kmer-conservation tool. But probably writing the info directly is better for users :) cc @Alessio-Campa
I don't currently understand functionality 3. Could you please define what a "matching bitvector" is?
Thanks a lot!
My interpretation of (tell me if I'm misinterpreting @karel-brinda) of 3) is, a bit vector of length m - k + 1 (where m is the query length) that says which k-mers of a query matched in the index and which did not. Perhaps matched at all, or maybe a bit-vector for the (co)best references, showing which particular k-mers matched that reference.
Regarding bindings. If we had a plain C API for these 3 pieces of functionality, then making both rust and python bindings would be rather trivial. Any thoughts on what would be required there (to have a C API for this)?
a bit vector of length m - k + 1 (where m is the query length) that says which k-mers of a query matched in the index and which did not.
Ah, then it is already implicitly coded in the output of the kmer-conservation query. So we are talking about different encoding formats.