Add a --header option when --outfmt 6 is used
It would be useful if there was a --header (or similar name) that could be used when --outfmt 6 is used. That would allow people to post-process DIAMOND format 6 output that contained any subset of the available fields and in any order. It would also help document what was in the output for when you return to an output file and you're wondering what exactly the fields are or what order they're in etc. It would enable someone (me, perhaps) to write general (i.e., independent of the fields used or their order) classes that process DIAMOND output, using tools like pandas to slurp in an entire output file with one call to read_table (see http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_table.html#pandas.read_table).
The --header option has been included in the latest release. It prints a description of the columns and also the diamond version and invocation. Feel free to check it before I make it official.
Thank you for adding --header
That was my first question when I opened the default tab-delimited output: what are these 12 columns?
My only feedback would be that the # Fields: line that is added lists the headers as comma-separated. The data is tab-separated, so I had to do a simple search/replace prior to pasting the headers into the Excel file I was using to browse the results.
Be sure to update the PDF manual to mention --header
Hi @bbuchfink Thanks from me too - I think I missed this originally. @alchemistmatt I don't understand your request exactly, you say the data is TAB-separated but you want the field names to be comma-separated? Maybe you meant you want them both TAB-separated? @bbuchfink, I don't think you should also print the DIAMOND version and invocation because the --header option is also making the output more machine readable and that info just gets in the way. Probably a medium-sophisticated user will want to use --header to produce something that can be read by a tool that can handle TSV, without having to filter out additional lines. You could print those lines to stderr but then they'd always show up unless the user knew how to independently throw stderr away.
So I'd either not print those lines, or else have a --quiet option that causes DIAMOND to never print any additional information other than what's asked for, or add an --invocation option that causes it to print invocation info on stdout.
Just another happy DIAMOND user! :-)
I'm saying it's odd that the header names are comma separated while the data itself is tab-separated. If it was me, instead of this
# DIAMOND v0.9.24. http://github.com/bbuchfink/diamond
# Invocation: diamond.exe blastp -d H_sapiens_Uniprot_trembl_2015-10-14 -q TestPeptides.fasta -o matches.txt --header
# Fields: Query ID, Subject ID, Percentage of identical matches, Alignment length, Number of mismatches, Number of gap openings, Start of alignment in query, End of alignment in query, Start of alignment in subject, End of alignment in subject, Expected value, Bit score
sp|P54578|UBP14_HUMAN tr|D3DUG9|D3DUG9_HUMAN 100.0 29 0 0 1 29 375 403 3.1e-12 68.2
sp|P54578|UBP14_HUMAN tr|A6NJA2|A6NJA2_HUMAN 100.0 29 0 0 1 29 355 383 3.1e-12 68.2
sp|P54578|UBP14_HUMAN tr|B2RD79|B2RD79_HUMAN 96.6 29 1 0 1 29 401 429 1.2e-11 66.2
I'd do this
# DIAMOND v0.9.24. http://github.com/bbuchfink/diamond
# Invocation: diamond.exe blastp -d H_sapiens_Uniprot_trembl_2015-10-14 -q TestPeptides.fasta -o matches.txt --header
Query ID Subject ID Percentage of identical matches Alignment length Number of mismatches Number of gap openings Start of alignment in query End of alignment in query Start of alignment in subject End of alignment in subject Expected value Bit score
sp|P54578|UBP14_HUMAN tr|D3DUG9|D3DUG9_HUMAN 100.0 29 0 0 1 29 375 403 3.1e-12 68.2
sp|P54578|UBP14_HUMAN tr|A6NJA2|A6NJA2_HUMAN 100.0 29 0 0 1 29 355 383 3.1e-12 68.2
Also, as @terrycojones says, adding the extra # lines could be problematic. However, from an automation standpoint, it's easy enough to use text parsing commands to remove the first two lines, e.g.
tail -n +3 matches.txt
or
egrep -v "^# .+" matches.txt
@alchemistmatt - OK, that's what I thought. Sorry, I didn't read it properly.
The # lines can be filtered out, but it would be better not to have to. If DIAMOND is producing (say) 100M lines of output, then # lines force the introduction of another process and make the kernel read/write all that data into/out of an (in-memory) pipe, which really means double work (reading from DIAMOND and writing into the tail or egrep, and then reading from that process and writing into the subsequent one). It's making the kernel do a ton of extra I/O that @bbuchfink could eliminate for us :-)
Ok, I see your point about the header format. Changing things that break compatibility with older versions is also problematic, so I will probably add something like --header 2 to choose between header formats.
Sounds good!
Did --header ever option ever get added? I'm trying to use this now, and it does not seem to be working. When I add --header to the the command line I get the following error. I also don't see it on the pdf manual either.
Error: Invalid option: header
The option does work for me. Check your version using diamond version and upgrade if necessary.
I believe this functionality was added with commit ef28703e39d9d7f0e896b65f30c9d678323564c7 Download the latest release from https://github.com/bbuchfink/diamond/releases
Just wanted to say that as a user, I'd love to have the --header 2 option adding the column names and nothing else.
(and thank you for providing Diamond to the community!)
Sorry that this completely got lost! Since v2.1.0 you can now use --header simple to just print the field keywords as headers.