Question about avp prepare output
Hi @GDKO,
it's me again @bheimbu. I have a question concerning the output of avp prepare…
query name donor ingroup AI HGTindex query hits number AHS outg_pct
229769_8676 Uniref90|UniRef90_A0A4R7CTD7:14:38.4:8.48e-75:250 Uniref90|UniRef90_E7CIQ5:1:45.4:4.62e-100:311 -58.17194306956131 -61.0 500 26007.388157305424 99
229769_3470 :::: Uniref90|UniRef90_A0A834V977:15:42.8:9.47e-248:726 -460.51701859880916 -726.0 500 -41372.008650311625 0
229769_4622 Uniref90|UniRef90_A0A2S9JLU1:1:49.8:1.58e-130:395 Uniref90|UniRef90_E7CIQ5:252:43.2:2.87e-103:321 62.766684693571875 74.0 500 42835.27786623576 99
229769_7930 Uniref90|UniRef90_L8GME8:7:50.8:5.29e-130:387 Uniref90|UniRef90_A0A8J2WLW8:39:37.2:4.09e-64:218 151.7133428617915 169.0 500 2411.325978706731 94
229769_4131 Uniref90|UniRef90_UPI00093AA699:24:63.7:1.35e-210:615 Uniref90|UniRef90_A0AA38IBE4:2:77.3:2.96e-260:730 -1.3500311979441904e-10 -115.0 500 140346.83844536747 94
229769_22899 Uniref90|UniRef90_A0A7C3TB98:3:41.8:1.05e-38:149 Uniref90|UniRef90_A0A9Q0N8B2:240:33.5:5.51e-18:90.5 47.70947631887631 58.5 500 2152.0361571608355 99
Why is 229769_3470 excluded from downstream analyses? It's actually a well-annotated contig including several GH2s etc., see below.
Cheers @bheimbu
ps: This also happens for other species I'm studying.
Hi @bheimbu,
In the case of 229769_3470 there are no hits from donor species and only hits from ingroup taxa. So all the metrics will be negative.
Just to be sure is 229769_3470 a contig or a protein?
Cheers, Georgios
Cheers, Georgios
Hi,
actually it's a protein sequence looking like this…
>229769_3470
IGIVFTLYLVSICSADYLSLDGNEWTANSYNKAISVKGVTVPGSIYSDLRTAGILKNLYAFKNDINYRWVSYDNWTYERTFVDSTLLNAQTVNLVANGIDTVSTVYINDKIVGKTNNQFIRYKFDVKNVLKNGENNIRIAFESAPIYARIQSEDFKKKYNYTVDCNDKVYRGECHFNFIRKMASSFAWDWGPAFPTQGIWKSIGIEAKDKPIIRDITVETIPDDKYIDWTLNFRLFMESAPNQEFDAIFNISLEGYPLINLEKKVRTDSNGNCVAMFTTPTKLIPIIPWFPNGVGNNLQTLYRLRVELSFKDSKEISIQTKKIGFRTIELVQEPIKPEGLTFYFKVNGLPFYAKGTNWIPANVLMEDLTPQYIRHLLGSVKQSNMNMMRVWGGGVYESDLFYEIADEMGIMIWQDMMYACALYPATKQFLEIVEPEVTQQIRRLQHHPSIAIWSGNNENEGAVYVFPNLTLHKRDYVELYINHIMKIILQEDTTRPFVPSSLSNGDEEKKEGWISKDSQDERYGDVHYYNYDAPLWDWKIYPNGKFTSEFGYNSYSSLETFLEVLEQKDLTFPLSDNMEWRQHHPGGTQSMFKEIGISITYPSNGGVQRFNDLIYLSQIMQAMAMKTEVEHYRRNRDIDPKTGKGNTMGALYWQLNDIWQGSTWASIEYGGKWKALQSYAIQFFDNLLVSPYEDTNQTLKVSLVRDDYFGELEFVLSLKVYKWNENKPIHEINEKVKTNSFSAKVVYIKSIPELLTESKCLDRTECVLSFEVKNVETNNFMLLTEPKNSKLINPNIKVVEVKEVGMTPMGKIFEITLSADAIAPFVLLDFKLDSGLKGQFIENGFFIFGGLKRVRLQTKATTTPQQIKDNLIFKTVTDVI
Cheers
Hi @bheimbu,
In that case, based on your groups.yaml file all hits belong to Ingroup taxa.
Cheers, Georgios
But why is it not listed in the classification_results.txt file then, see …
query name donor ingroup AI HGTindex query hits number AHS outg_pct
229769_8676 Uniref90|UniRef90_A0A4R7CTD7:14:38.4:8.48e-75:250 Uniref90|UniRef90_E7CIQ5:1:45.4:4.62e-100:311 -58.17194306956131 -61.0 500 26007.388157305424 99
229769_3470 :::: Uniref90|UniRef90_A0A834V977:15:42.8:9.47e-248:726 -460.51701859880916 -726.0 500 -41372.008650311625 0
229769_4622 Uniref90|UniRef90_A0A2S9JLU1:1:49.8:1.58e-130:395 Uniref90|UniRef90_E7CIQ5:252:43.2:2.87e-103:321 62.766684693571875 74.0 500 42835.27786623576 99
229769_7930 Uniref90|UniRef90_L8GME8:7:50.8:5.29e-130:387 Uniref90|UniRef90_A0A8J2WLW8:39:37.2:4.09e-64:218 151.7133428617915 169.0 500 2411.325978706731 94
229769_4131 Uniref90|UniRef90_UPI00093AA699:24:63.7:1.35e-210:615 Uniref90|UniRef90_A0AA38IBE4:2:77.3:2.96e-260:730 -1.3500311979441904e-10 -115.0 500 140346.83844536747 94
229769_22899 Uniref90|UniRef90_A0A7C3TB98:3:41.8:1.05e-38:149 Uniref90|UniRef90_A0A9Q0N8B2:240:33.5:5.51e-18:90.5 47.70947631887631 58.5 500 2152.0361571608355 99
…these are six protein sequences, but the classification_results.txt file gives only 5 classifications…
HGT_complex : 1
NM_Eukaryota_complex : 0
Eukaryota : 1
Viridiplantae : 0
Fungi : 0
Oomycota : 0
Prokaryota_complex : 0
Bacteria : 0
Archaea : 0
Viroids : 0
Viruses : 0
Complex : 0
Ingroup : 3
Unknown : 0
Cheers
Hi @bheimbu ,
Ok now I understand.
Within config.yaml these options control which proteins to consider for downstream analyses
ai_cutoff: 0
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai or ahs" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
Cheers, Georgios
Ok, so I have to adjust the config.yaml file? Because mine looks like this…
## Algorithm options
# prepare
ai_cutoff: 30
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "(ai or ahs) and outg_pct" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
Cheers
Bastian
Hi @bheimbu,
If you want to select everything the following changes will do the trick since ai cannot go below -460.5
ai_cutoff: -500
selection: "ai"
Cheers, Georgios
Ok, I'll try.
Thx Bastian
Hi,
even though I've adjusted my config.yaml file, it still gives me…
query name donor ingroup AI HGTindex query hits number AHS outg_pct
229769_3470 :::: Uniref90|UniRef90_A0A834V977:15:42.8:9.47e-248:726 -460.51701859880916 -726.0 500 -41372.008650311625 0
229769_8676 Uniref90|UniRef90_A0A4R7CTD7:14:38.4:8.48e-75:250 Uniref90|UniRef90_E7CIQ5:1:45.4:4.62e-100:311 -58.17194306956131 -61.0 500 26007.388157305424 99
229769_7930 Uniref90|UniRef90_L8GME8:7:50.8:5.29e-130:387 Uniref90|UniRef90_A0A8J2WLW8:39:37.2:4.09e-64:218 151.7133428617915 169.0 500 2411.325978706731 94
229769_4131 Uniref90|UniRef90_UPI00093AA699:24:63.7:1.35e-210:615 Uniref90|UniRef90_A0AA38IBE4:2:77.3:2.96e-260:730 -1.3500311979441904e-10 -115.0 500 140346.83844536747 94
229769_22899 Uniref90|UniRef90_A0A7C3TB98:3:41.8:1.05e-38:149 Uniref90|UniRef90_A0A9Q0N8B2:240:33.5:5.51e-18:90.5 47.70947631887631 58.5 500 2152.0361571608355 99
229769_4622 Uniref90|UniRef90_A0A2S9JLU1:1:49.8:1.58e-130:395 Uniref90|UniRef90_E7CIQ5:252:43.2:2.87e-103:321 62.766684693571875 74.0 500 42835.27786623576 99
…and here is my config.yaml file
---
max_threads: 20
# DB path
blast_db_path: /usr/users/bheimbu/bin/scratch/orthologs_oribatida/avp/databases/uniref90.fasta.dmnd
fasta_path: /usr/users/bheimbu/bin/scratch/orthologs_oribatida/avp/databases/uniref90.fasta.fixed.gz # diamond: change to the local fasta path for sp, ur90, or custom database
mode: ur90 # use blast for blast database, use sp for swissprot database, ur90 for uniref90 or custom database
data_type: AA # data type DNA, AA
## Algorithm options
# prepare
ai_cutoff: -500
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
percent_identity: 100 # select hits equal or below this number
cutoffextend: 20 # when ingroup hit is found, we take this hit + n hits
trimal: true
min_num_hits: 4 # select queries with at least that many blast hits
percentage_similar_hits: 0.7 # group queries based on this
# detect, classify, evaluate
fastml: true # Use fasttree instead of IQTree
node_support: 0 # nodes below that number will collapse
complex_per_ingroup: 20 # if D/(D+I) smaller than this then node is considered Ingroup
complex_per_donor: 80 # if D/(D+I) greater than this then node is considered Donor
complex_per_node: 90 # if node contains percent number of this category, it is assigned
# Program specific options
mafft_options: '--anysymbol --auto'
trimal_options: '-automated1'
#IQ-Tree
iqmodel: '-mset WAG,LG,JTT -AICc -mrate E,I,G,R'
ufbootstrap: 1000
iq_threads: 20
Cheers Bastian
Hi @bheimbu,
But why is it not listed in the classification_results.txt file
This was because it wasn't selected for downstream analyses based on your previous criteria
even though I've adjusted my
config.yaml file, it still gives me
Why will that change? As stated before, all the hits for 229769_3470 are from the ingroup taxa.
Can you explain in a bit more detail what is the issue?
Cheers, Georgios
Hi,
I have many protein sequences from different genomes, in the end I want to summarize everything in a table. And if some proteins are not selected for analysis, I have to count them by hand, more or less.
What I dont get is the difference between saying " all the hits for 229769_3470 are from the ingroup taxa" and the later assignment of Ingroup in classification_results.txt -- isn't that technically speaking the same?!
Cheers Bastian
Hi @bheimbu,
I have many protein sequences from different genomes, in the end I want to summarize everything in a table. And if some proteins are not selected for analysis, I have to count them by hand, more or less.
Ok didn't the new run fix this? Did you run classify after?
What I dont get is the difference between saying " all the hits for 229769_3470 are from the ingroup taxa" and the later assignment of Ingroup in classification_results.txt -- isn't that technically speaking the same?!
It is; most of the time people are interested in HGTs rather than non-HGTs and filtering low probable ones with the parameters speeds up the runtime. That is why they are not considered for downstream analyses, including avp classify .
If that is useful though, I can modify the script to take into account the proteins that were not selected.
Hope that helps, Georgios
I did run everything (i.e. all steps there are in avp including classify, and even hgt_local_score), it didn't change anything.
For me, it would be helpful indeed. But I can also do it by hand.
I did run everything (i.e. all steps there are in avp including classify, and even hgt_local_score), it didn't change anything.
Strange, can you send me the folder (zip) to my personal email?
For me, it would be helpful indeed. But I can also do it by hand.
Ok! I will add this functionality
You mean the two folders, one for…
# prepare
ai_cutoff: 30
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "(ai or ahs) and outg_pct" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
…and the other one for…
# prepare
ai_cutoff: -500
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
?
Just for the second one
sent it via email.
thx Bastian
Hi,
did you find the problem?
Cheers Bastian
Hi @bheimbu,
I have released a new version.
You can now provide the [ai.out] to avp classify with -a and it will add the number of excluded queries to the Ingroup classification.
Cheers, Georgios
Many thx.
Cheers
After downloading the latest release,
I get the following error related to avp prepare…
$ ../avp prepare \
> -a output/"$species"/"$species".out_ai.out \
> -o output/"$species" \
> -f input/"$species".fa \
> -b output/"$species"/"$species".out \
> -x groups.yaml \
> -c config.yaml
[+] Setting up
Traceback (most recent call last):
File "/scratch1/users/bheimbu/orthologs_oribatida/avp/08082024/../avp", line 7, in <module>
main()
File "/scratch1/users/bheimbu/orthologs_oribatida/avp/depot/interface.py", line 29, in main
prepare.main()
File "/scratch1/users/bheimbu/orthologs_oribatida/avp/depot/prepare.py", line 105, in main
number_hits_noingroup = config_opts["number_hits_noingroup"]
~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'number_hits_noingroup'
Cheers Bastian
Hi @bheimbu,
Sorry about that! A new parameter was added in the version 1.0.6 to the config file.
Added parameter
number_hits_noingroupto define the max number of hits that are selected for a query when there is no ingroup hit (previously fixed at 50).
Just add number_hits_noingroup: 50 after cutoffextend: 20.
Cheers, Georgios
Ok,
now it runs.
Thx Bastian