Feature request: Output format specification + #matched kmer matrix
As a light version of #70, it would be also extremely useful if we, as users, just match our queries against the indexes and get the following output:
query length ref1 ref2 ref3
q1 3876 3301 207 1029
q2 100 0 0 0
q3 1 1 0 1
I.e., query name, its length, and the number of matching k-mers to each ref.
This would be done only if request via st like --format matrix, so it wouldn't impact benchmarking for papers, etc.
(Side note: I'm aware that for very large databases this is not a great format. However, many applications use Fulgor on small instances only, and there such an output would be super relevant.)
Hi @karel-brinda,
We (cc @jermp @Alessio-Campa) have been discussing output formats, and there's also a larger effort afoot (that I'm trying to start at least) to discuss best practices for formats in bioinformatics software. Anyway, I just want to put a pin in this. I think have a descriptive output format would be great -- however, in this case, I'd favor having a concise, efficient (binary?), self-describing output format (along with some bundled metadata e.g. like fulgor version, runtime, num threads, reference database signature etc.), with an option to easily convert to e.g. a TSV format like the one you describe above.
Hi, yes, this can definitely be done with little effort (noting that, however, the teaching period is starting again tomorrow). As Rob mentioned, I would provide this in a more succinct format as I expect that more references will get count 0 for typical query loads. I would also be nice to have a binary format output for all tools, including this one.