--func_annot
The wiki page on profiling shows the output as:
sample01 sample04 sample05 sample08
g00001 1 0 1 0
g00002 0 1 1 1
g00003 0 0 0 1
g00003 1 1 1 1
...but I get UniRef90 IDs for each pangenome instead of g[0-9]{5} (panphlan 3.1).
Which version of UniRef90 are the IDs from? I tried using map_eggnog_uniref90.txt.gz from the HUMAnN3 utility mapping file collection (UniRef 201901), and <5% of my panphlan output UniRef ID overlap with any IDs in the mapping file, suggesting that the panphlan UniRef IDs are from a different (older?) version of UniRef.
I didn't see anything in the wiki about which (biobakery) files are actually available to use with --func_annot. Can I use the HUMAnN3 utility mapping files?
Hello Nick !
The wiki page with g[0-9]{5} aims to give a example of what the output looks like. Sorry if it's confusing.
Both PanPhlAn 3 and HUMAnN 3 should use the same UniRef90 collection, but HUMAnN covers everything while PanPhlAn annotation files provided with the pangenome are species-specific and often contains uncharacterized (or poorly characterized) proteins. (details can be found in this preprint )
The --func_annot aim is simply to add some extra column in the output presence/absence matrix with some user provided mapping file: It can be the species annotation file provided with the downloaded pangenome or a user custom file.
Hope this could help you. Btw I advise you to raise this kind of concerns on the bioBakery help forum
The wiki page with g[0-9]{5} aims to give a example of what the output looks like. Sorry if it's confusing.
Thanks for the clarification.
Both PanPhlAn 3 and HUMAnN 3 should use the same UniRef90 collection
That is UniRef 2019-01, correct?
covers everything while PanPhlAn annotation files provided with the pangenome are species-specific and often contains uncharacterized (or poorly characterized) proteins
So then the UniRef IDs in the PanPhlAn output should be a subset of all UniRef IDs. This doesn't really explain the low % mapping of IDs to the Humann3 mapping files.
The --func_annot aim is simply to add some extra column in the output presence/absence matrix with some user provided mapping file: It can be the species annotation file provided with the downloaded pangenome or a user custom file.
That's good to know. What is the format?
Btw I advise you to raise this kind of concerns on the bioBakery help forum
You are right. This is more of a usage/docs question versus a bug/issue. Do you also want bugs reports on the bioBakery help forum?
Yes both used ChocoPhlAn (our internal pipeline) based on UniRef 2019-01
That is strange indeed that a low percentage of PanPhlAn UniRef90 maps, I'll check that whenever I find the time. I've you tried mapping UniRef50 instead ?
--func_annot should be the path to a tsv file mapping UniRef90 to whatever you want. If several columns are available, you can specify the one you want with the --field argument. Basically like the annotation file provided panphlan_[species_name]_annot.tsv
Yes, the best would be bug report/code related stuff on GitHub and usage/general discussions on the forum, there should be more people interacting and checking it. On top of that is will be more convenient when questions concern several software at the same time
That is strange indeed that a low percentage of PanPhlAn UniRef90 maps, I'll check that whenever I find the time. I've you tried mapping UniRef50 instead ?
Any updates on this? Were you able to reproduce the low UniRef90 ID mapping rate?
I've you tried mapping UniRef50 instead ?
Where is the docs on using UniRef50 instead of UniRef90? In the wiki, I only see info on using UniRef90.
Hello Nick,
sorry, I've been busy with other projects in the past month and I haven't check that yet. I'll let you know when I'll have some news.