fulgor icon indicating copy to clipboard operation
fulgor copied to clipboard

Feature request: API / Rust bindings

Open karel-brinda opened this issue 4 months ago • 4 comments

Hi Giulio,

Thanks again for all the work on Fulgor. I really appreciate all the progressive innovations that have gone into it.

It's really a great tool, both algorithm- and data-structure-wise – however, unfortunately, in almost all (biology-centric) applications where we want to use it, we have to make modifications to the source code to make it work for our purposes, because its output and query parameters do not correspond to our needs. As the program changes quite a lot (e.g., it will soon migrate to the new SSHash), this way of using it is not really sustainable.

Is there a chance that Rust bindings (or, in the worst case, C++ bindings) could be provided for the main query functionality? Essentially, we are interested in the following three functions (two high-priority ones and one low-priority one):

  1. open an index – this would simply open an index from disk, or fail if it's the wrong version of the index (e.g., from an older Fulgor), or if it's corrupted; this could also return a list of references names in the indexed order

  2. query sequence – for a given string, it would return an array of numbers, with the number of matching k-mers in the first, second, third, ... reference. Important – we are, in principle, interested in all this information, not just the somehow pre-filtered output. And if some pre-filtering must be done for some reason (e.g., thresholding), it should allow two options: a minimum number of matching k-mers or a minimum proportion of matching k-mers (without any "advanced" filtering functionality to make it "smart").

  3. get matching bit-vectors – returning the bit vectors of the matches in a given reference (expected to be used for a small subset of references, e.g., only the best-matching references)

In my opinion, 1.+2. might be not that difficult for you/. It would massively increase the applicability of Fulgor – we could then really use it as an essential building block in many of our applications.

(Also cc @rob-p @Francii-B)

karel-brinda avatar Sep 26 '25 21:09 karel-brinda

Hi @karel-brinda and thanks for your kind words on the Fulgor index!

I agree that the codebase of Fulgor is changing frequently (for the better, though!) and this might result is some incompatibilities for the pipelines that rely on it. However, it is work in progress and we appreciate your suggestions a lot.

So, I haven't yet learnt Rust so I don't think Rust bindings are going to appear any time soon. (Of course, if you know someone who can be interested in developing them, it'd be fantastic! cc @rob-p) So, would you like to use Fulgor from Rust itself? On the other hand, I'm not sure what a "C++ binding" is in this context, as Fulgor is written in C++. Maybe you meant "Python" bindings?

I understand the functionalities 1 and 2. About 2: the number of matching kmers in ref i will be the number of kmers of the query sequence that belong to ref i. This does not have to do with the abundance of a given kmer in a reference, which is an information that Fulgor does not store yet. Right? Then this information can actually be already retrieved from the output of the kmer-conservation tool. But probably writing the info directly is better for users :) cc @Alessio-Campa

I don't currently understand functionality 3. Could you please define what a "matching bitvector" is?

Thanks a lot!

jermp avatar Sep 27 '25 14:09 jermp

My interpretation of (tell me if I'm misinterpreting @karel-brinda) of 3) is, a bit vector of length m - k + 1 (where m is the query length) that says which k-mers of a query matched in the index and which did not. Perhaps matched at all, or maybe a bit-vector for the (co)best references, showing which particular k-mers matched that reference.

rob-p avatar Sep 27 '25 14:09 rob-p

Regarding bindings. If we had a plain C API for these 3 pieces of functionality, then making both rust and python bindings would be rather trivial. Any thoughts on what would be required there (to have a C API for this)?

rob-p avatar Sep 27 '25 15:09 rob-p

a bit vector of length m - k + 1 (where m is the query length) that says which k-mers of a query matched in the index and which did not.

Ah, then it is already implicitly coded in the output of the kmer-conservation query. So we are talking about different encoding formats.

jermp avatar Sep 27 '25 16:09 jermp