diamond icon indicating copy to clipboard operation
diamond copied to clipboard

Add a --header option when --outfmt 6 is used

Open terrycojones opened this issue 7 years ago • 12 comments

It would be useful if there was a --header (or similar name) that could be used when --outfmt 6 is used. That would allow people to post-process DIAMOND format 6 output that contained any subset of the available fields and in any order. It would also help document what was in the output for when you return to an output file and you're wondering what exactly the fields are or what order they're in etc. It would enable someone (me, perhaps) to write general (i.e., independent of the fields used or their order) classes that process DIAMOND output, using tools like pandas to slurp in an entire output file with one call to read_table (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table).

terrycojones avatar Nov 23 '18 13:11 terrycojones

The --header option has been included in the latest release. It prints a description of the columns and also the diamond version and invocation. Feel free to check it before I make it official.

bbuchfink avatar Dec 05 '18 20:12 bbuchfink

Thank you for adding --header That was my first question when I opened the default tab-delimited output: what are these 12 columns? My only feedback would be that the # Fields: line that is added lists the headers as comma-separated. The data is tab-separated, so I had to do a simple search/replace prior to pasting the headers into the Excel file I was using to browse the results.

Be sure to update the PDF manual to mention --header

alchemistmatt avatar Mar 25 '19 21:03 alchemistmatt

Hi @bbuchfink Thanks from me too - I think I missed this originally. @alchemistmatt I don't understand your request exactly, you say the data is TAB-separated but you want the field names to be comma-separated? Maybe you meant you want them both TAB-separated? @bbuchfink, I don't think you should also print the DIAMOND version and invocation because the --header option is also making the output more machine readable and that info just gets in the way. Probably a medium-sophisticated user will want to use --header to produce something that can be read by a tool that can handle TSV, without having to filter out additional lines. You could print those lines to stderr but then they'd always show up unless the user knew how to independently throw stderr away.

So I'd either not print those lines, or else have a --quiet option that causes DIAMOND to never print any additional information other than what's asked for, or add an --invocation option that causes it to print invocation info on stdout.

Just another happy DIAMOND user! :-)

terrycojones avatar Mar 25 '19 21:03 terrycojones

I'm saying it's odd that the header names are comma separated while the data itself is tab-separated. If it was me, instead of this

# DIAMOND v0.9.24. http://github.com/bbuchfink/diamond
# Invocation: diamond.exe blastp -d H_sapiens_Uniprot_trembl_2015-10-14 -q TestPeptides.fasta -o matches.txt --header
# Fields: Query ID, Subject ID, Percentage of identical matches, Alignment length, Number of mismatches, Number of gap openings, Start of alignment in query, End of alignment in query, Start of alignment in subject, End of alignment in subject, Expected value, Bit score
sp|P54578|UBP14_HUMAN	tr|D3DUG9|D3DUG9_HUMAN	100.0	29	0	0	1	29	375	403	3.1e-12	68.2
sp|P54578|UBP14_HUMAN	tr|A6NJA2|A6NJA2_HUMAN	100.0	29	0	0	1	29	355	383	3.1e-12	68.2
sp|P54578|UBP14_HUMAN	tr|B2RD79|B2RD79_HUMAN	96.6	29	1	0	1	29	401	429	1.2e-11	66.2

I'd do this

# DIAMOND v0.9.24. http://github.com/bbuchfink/diamond
# Invocation: diamond.exe blastp -d H_sapiens_Uniprot_trembl_2015-10-14 -q TestPeptides.fasta -o matches.txt --header
Query ID	 Subject ID	 Percentage of identical matches	 Alignment length	 Number of mismatches	 Number of gap openings	 Start of alignment in query	 End of alignment in query	 Start of alignment in subject	 End of alignment in subject	 Expected value	 Bit score
sp|P54578|UBP14_HUMAN	tr|D3DUG9|D3DUG9_HUMAN	100.0	29	0	0	1	29	375	403	3.1e-12	68.2
sp|P54578|UBP14_HUMAN	tr|A6NJA2|A6NJA2_HUMAN	100.0	29	0	0	1	29	355	383	3.1e-12	68.2

Also, as @terrycojones says, adding the extra # lines could be problematic. However, from an automation standpoint, it's easy enough to use text parsing commands to remove the first two lines, e.g. tail -n +3 matches.txt or egrep -v "^# .+" matches.txt

alchemistmatt avatar Mar 25 '19 22:03 alchemistmatt

@alchemistmatt - OK, that's what I thought. Sorry, I didn't read it properly.

The # lines can be filtered out, but it would be better not to have to. If DIAMOND is producing (say) 100M lines of output, then # lines force the introduction of another process and make the kernel read/write all that data into/out of an (in-memory) pipe, which really means double work (reading from DIAMOND and writing into the tail or egrep, and then reading from that process and writing into the subsequent one). It's making the kernel do a ton of extra I/O that @bbuchfink could eliminate for us :-)

terrycojones avatar Mar 25 '19 22:03 terrycojones

Ok, I see your point about the header format. Changing things that break compatibility with older versions is also problematic, so I will probably add something like --header 2 to choose between header formats.

bbuchfink avatar Mar 26 '19 11:03 bbuchfink

Sounds good!

terrycojones avatar Mar 26 '19 12:03 terrycojones

Did --header ever option ever get added? I'm trying to use this now, and it does not seem to be working. When I add --header to the the command line I get the following error. I also don't see it on the pdf manual either.

Error: Invalid option: header

ewj34 avatar Aug 28 '20 17:08 ewj34

The option does work for me. Check your version using diamond version and upgrade if necessary.

bbuchfink avatar Aug 30 '20 09:08 bbuchfink

I believe this functionality was added with commit ef28703e39d9d7f0e896b65f30c9d678323564c7 Download the latest release from https://github.com/bbuchfink/diamond/releases

alchemistmatt avatar Aug 30 '20 15:08 alchemistmatt

Just wanted to say that as a user, I'd love to have the --header 2 option adding the column names and nothing else.

(and thank you for providing Diamond to the community!)

vinisalazar avatar Dec 07 '21 00:12 vinisalazar

Sorry that this completely got lost! Since v2.1.0 you can now use --header simple to just print the field keywords as headers.

bbuchfink avatar Mar 10 '23 11:03 bbuchfink