fulgor icon indicating copy to clipboard operation
fulgor copied to clipboard

Feature request: Output format specification + #matched kmer matrix

Open karel-brinda opened this issue 4 months ago • 2 comments

As a light version of #70, it would be also extremely useful if we, as users, just match our queries against the indexes and get the following output:

query	length	ref1	ref2	ref3
q1	3876	3301	207	1029
q2	100	0	0	0
q3	1	1	0	1

I.e., query name, its length, and the number of matching k-mers to each ref.

This would be done only if request via st like --format matrix, so it wouldn't impact benchmarking for papers, etc.

(Side note: I'm aware that for very large databases this is not a great format. However, many applications use Fulgor on small instances only, and there such an output would be super relevant.)

karel-brinda avatar Sep 26 '25 21:09 karel-brinda

Hi @karel-brinda,

We (cc @jermp @Alessio-Campa) have been discussing output formats, and there's also a larger effort afoot (that I'm trying to start at least) to discuss best practices for formats in bioinformatics software. Anyway, I just want to put a pin in this. I think have a descriptive output format would be great -- however, in this case, I'd favor having a concise, efficient (binary?), self-describing output format (along with some bundled metadata e.g. like fulgor version, runtime, num threads, reference database signature etc.), with an option to easily convert to e.g. a TSV format like the one you describe above.

rob-p avatar Sep 27 '25 00:09 rob-p

Hi, yes, this can definitely be done with little effort (noting that, however, the teaching period is starting again tomorrow). As Rob mentioned, I would provide this in a more succinct format as I expect that more references will get count 0 for typical query loads. I would also be nice to have a binary format output for all tools, including this one.

jermp avatar Sep 27 '25 14:09 jermp