AvP icon indicating copy to clipboard operation
AvP copied to clipboard

Question about avp prepare output

Open bheimbu opened this issue 1 year ago • 23 comments

Hi @GDKO,

it's me again @bheimbu. I have a question concerning the output of avp prepare

query name	donor	ingroup	AI	HGTindex	query hits number	AHS	outg_pct
229769_8676	Uniref90|UniRef90_A0A4R7CTD7:14:38.4:8.48e-75:250	Uniref90|UniRef90_E7CIQ5:1:45.4:4.62e-100:311	-58.17194306956131	-61.0	500	26007.388157305424	99
229769_3470	::::	Uniref90|UniRef90_A0A834V977:15:42.8:9.47e-248:726	-460.51701859880916	-726.0	500	-41372.008650311625	0
229769_4622	Uniref90|UniRef90_A0A2S9JLU1:1:49.8:1.58e-130:395	Uniref90|UniRef90_E7CIQ5:252:43.2:2.87e-103:321	62.766684693571875	74.0	500	42835.27786623576	99
229769_7930	Uniref90|UniRef90_L8GME8:7:50.8:5.29e-130:387	Uniref90|UniRef90_A0A8J2WLW8:39:37.2:4.09e-64:218	151.7133428617915	169.0	500	2411.325978706731	94
229769_4131	Uniref90|UniRef90_UPI00093AA699:24:63.7:1.35e-210:615	Uniref90|UniRef90_A0AA38IBE4:2:77.3:2.96e-260:730	-1.3500311979441904e-10	-115.0	500	140346.83844536747	94
229769_22899	Uniref90|UniRef90_A0A7C3TB98:3:41.8:1.05e-38:149	Uniref90|UniRef90_A0A9Q0N8B2:240:33.5:5.51e-18:90.5	47.70947631887631	58.5	500	2152.0361571608355	99

Why is 229769_3470 excluded from downstream analyses? It's actually a well-annotated contig including several GH2s etc., see below.

Bildschirmfoto vom 2024-09-11 13-15-29

Cheers @bheimbu

ps: This also happens for other species I'm studying.

bheimbu avatar Sep 11 '24 11:09 bheimbu

Hi @bheimbu,

In the case of 229769_3470 there are no hits from donor species and only hits from ingroup taxa. So all the metrics will be negative.

Just to be sure is 229769_3470 a contig or a protein?

Cheers, Georgios

Cheers, Georgios

GDKO avatar Sep 12 '24 09:09 GDKO

Hi,

actually it's a protein sequence looking like this…

>229769_3470
IGIVFTLYLVSICSADYLSLDGNEWTANSYNKAISVKGVTVPGSIYSDLRTAGILKNLYAFKNDINYRWVSYDNWTYERTFVDSTLLNAQTVNLVANGIDTVSTVYINDKIVGKTNNQFIRYKFDVKNVLKNGENNIRIAFESAPIYARIQSEDFKKKYNYTVDCNDKVYRGECHFNFIRKMASSFAWDWGPAFPTQGIWKSIGIEAKDKPIIRDITVETIPDDKYIDWTLNFRLFMESAPNQEFDAIFNISLEGYPLINLEKKVRTDSNGNCVAMFTTPTKLIPIIPWFPNGVGNNLQTLYRLRVELSFKDSKEISIQTKKIGFRTIELVQEPIKPEGLTFYFKVNGLPFYAKGTNWIPANVLMEDLTPQYIRHLLGSVKQSNMNMMRVWGGGVYESDLFYEIADEMGIMIWQDMMYACALYPATKQFLEIVEPEVTQQIRRLQHHPSIAIWSGNNENEGAVYVFPNLTLHKRDYVELYINHIMKIILQEDTTRPFVPSSLSNGDEEKKEGWISKDSQDERYGDVHYYNYDAPLWDWKIYPNGKFTSEFGYNSYSSLETFLEVLEQKDLTFPLSDNMEWRQHHPGGTQSMFKEIGISITYPSNGGVQRFNDLIYLSQIMQAMAMKTEVEHYRRNRDIDPKTGKGNTMGALYWQLNDIWQGSTWASIEYGGKWKALQSYAIQFFDNLLVSPYEDTNQTLKVSLVRDDYFGELEFVLSLKVYKWNENKPIHEINEKVKTNSFSAKVVYIKSIPELLTESKCLDRTECVLSFEVKNVETNNFMLLTEPKNSKLINPNIKVVEVKEVGMTPMGKIFEITLSADAIAPFVLLDFKLDSGLKGQFIENGFFIFGGLKRVRLQTKATTTPQQIKDNLIFKTVTDVI

Cheers

bheimbu avatar Sep 12 '24 09:09 bheimbu

Hi @bheimbu,

In that case, based on your groups.yaml file all hits belong to Ingroup taxa.

Cheers, Georgios

GDKO avatar Sep 12 '24 09:09 GDKO

But why is it not listed in the classification_results.txt file then, see …

query name	donor	ingroup	AI	HGTindex	query hits number	AHS	outg_pct
229769_8676	Uniref90|UniRef90_A0A4R7CTD7:14:38.4:8.48e-75:250	Uniref90|UniRef90_E7CIQ5:1:45.4:4.62e-100:311	-58.17194306956131	-61.0	500	26007.388157305424	99
229769_3470	::::	Uniref90|UniRef90_A0A834V977:15:42.8:9.47e-248:726	-460.51701859880916	-726.0	500	-41372.008650311625	0
229769_4622	Uniref90|UniRef90_A0A2S9JLU1:1:49.8:1.58e-130:395	Uniref90|UniRef90_E7CIQ5:252:43.2:2.87e-103:321	62.766684693571875	74.0	500	42835.27786623576	99
229769_7930	Uniref90|UniRef90_L8GME8:7:50.8:5.29e-130:387	Uniref90|UniRef90_A0A8J2WLW8:39:37.2:4.09e-64:218	151.7133428617915	169.0	500	2411.325978706731	94
229769_4131	Uniref90|UniRef90_UPI00093AA699:24:63.7:1.35e-210:615	Uniref90|UniRef90_A0AA38IBE4:2:77.3:2.96e-260:730	-1.3500311979441904e-10	-115.0	500	140346.83844536747	94
229769_22899	Uniref90|UniRef90_A0A7C3TB98:3:41.8:1.05e-38:149	Uniref90|UniRef90_A0A9Q0N8B2:240:33.5:5.51e-18:90.5	47.70947631887631	58.5	500	2152.0361571608355	99

…these are six protein sequences, but the classification_results.txt file gives only 5 classifications…

HGT_complex		: 1
NM_Eukaryota_complex	: 0
Eukaryota		: 1
Viridiplantae		: 0
Fungi			: 0
Oomycota		: 0
Prokaryota_complex	: 0
Bacteria		: 0
Archaea			: 0
Viroids			: 0
Viruses			: 0
Complex			: 0
Ingroup			: 3
Unknown			: 0

Cheers

bheimbu avatar Sep 12 '24 09:09 bheimbu

Hi @bheimbu ,

Ok now I understand.

Within config.yaml these options control which proteins to consider for downstream analyses

ai_cutoff: 0
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai or ahs" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"

Cheers, Georgios

GDKO avatar Sep 12 '24 09:09 GDKO

Ok, so I have to adjust the config.yaml file? Because mine looks like this…

## Algorithm options
# prepare
ai_cutoff: 30
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "(ai or ahs) and outg_pct" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"

Cheers

Bastian

bheimbu avatar Sep 12 '24 09:09 bheimbu

Hi @bheimbu,

If you want to select everything the following changes will do the trick since ai cannot go below -460.5

ai_cutoff: -500
selection: "ai"

Cheers, Georgios

GDKO avatar Sep 12 '24 09:09 GDKO

Ok, I'll try.

Thx Bastian

bheimbu avatar Sep 12 '24 09:09 bheimbu

Hi,

even though I've adjusted my config.yaml file, it still gives me…

query name	donor	ingroup	AI	HGTindex	query hits number	AHS	outg_pct
229769_3470	::::	Uniref90|UniRef90_A0A834V977:15:42.8:9.47e-248:726	-460.51701859880916	-726.0	500	-41372.008650311625	0
229769_8676	Uniref90|UniRef90_A0A4R7CTD7:14:38.4:8.48e-75:250	Uniref90|UniRef90_E7CIQ5:1:45.4:4.62e-100:311	-58.17194306956131	-61.0	500	26007.388157305424	99
229769_7930	Uniref90|UniRef90_L8GME8:7:50.8:5.29e-130:387	Uniref90|UniRef90_A0A8J2WLW8:39:37.2:4.09e-64:218	151.7133428617915	169.0	500	2411.325978706731	94
229769_4131	Uniref90|UniRef90_UPI00093AA699:24:63.7:1.35e-210:615	Uniref90|UniRef90_A0AA38IBE4:2:77.3:2.96e-260:730	-1.3500311979441904e-10	-115.0	500	140346.83844536747	94
229769_22899	Uniref90|UniRef90_A0A7C3TB98:3:41.8:1.05e-38:149	Uniref90|UniRef90_A0A9Q0N8B2:240:33.5:5.51e-18:90.5	47.70947631887631	58.5	500	2152.0361571608355	99
229769_4622	Uniref90|UniRef90_A0A2S9JLU1:1:49.8:1.58e-130:395	Uniref90|UniRef90_E7CIQ5:252:43.2:2.87e-103:321	62.766684693571875	74.0	500	42835.27786623576	99

…and here is my config.yaml file

---
max_threads: 20

# DB path
blast_db_path: /usr/users/bheimbu/bin/scratch/orthologs_oribatida/avp/databases/uniref90.fasta.dmnd
fasta_path: /usr/users/bheimbu/bin/scratch/orthologs_oribatida/avp/databases/uniref90.fasta.fixed.gz  # diamond: change to the local fasta path for sp, ur90, or custom database
mode: ur90    # use blast for blast database, use sp for swissprot database, ur90 for uniref90 or custom database
data_type: AA    # data type DNA, AA

## Algorithm options
# prepare
ai_cutoff: -500
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"
percent_identity: 100   # select hits equal or below this number
cutoffextend: 20    # when ingroup hit is found, we take this hit + n hits
trimal: true
min_num_hits: 4   # select queries with at least that many blast hits
percentage_similar_hits: 0.7  # group queries based on this
# detect, classify, evaluate
fastml: true  # Use fasttree instead of IQTree
node_support: 0  # nodes below that number will collapse
complex_per_ingroup: 20   # if D/(D+I) smaller than this then node is considered Ingroup
complex_per_donor: 80   # if D/(D+I) greater than this then node is considered Donor
complex_per_node: 90  # if node contains percent number of this category, it is assigned

# Program specific options
mafft_options: '--anysymbol --auto'
trimal_options: '-automated1'

#IQ-Tree
iqmodel: '-mset WAG,LG,JTT -AICc -mrate E,I,G,R'
ufbootstrap: 1000
iq_threads: 20

Cheers Bastian

bheimbu avatar Sep 12 '24 11:09 bheimbu

Hi @bheimbu,

But why is it not listed in the classification_results.txt file

This was because it wasn't selected for downstream analyses based on your previous criteria

even though I've adjusted my config.yaml file, it still gives me

Why will that change? As stated before, all the hits for 229769_3470 are from the ingroup taxa.

Can you explain in a bit more detail what is the issue?

Cheers, Georgios

GDKO avatar Sep 12 '24 11:09 GDKO

Hi,

I have many protein sequences from different genomes, in the end I want to summarize everything in a table. And if some proteins are not selected for analysis, I have to count them by hand, more or less.

What I dont get is the difference between saying " all the hits for 229769_3470 are from the ingroup taxa" and the later assignment of Ingroup in classification_results.txt -- isn't that technically speaking the same?!

Cheers Bastian

bheimbu avatar Sep 12 '24 11:09 bheimbu

Hi @bheimbu,

I have many protein sequences from different genomes, in the end I want to summarize everything in a table. And if some proteins are not selected for analysis, I have to count them by hand, more or less.

Ok didn't the new run fix this? Did you run classify after?

What I dont get is the difference between saying " all the hits for 229769_3470 are from the ingroup taxa" and the later assignment of Ingroup in classification_results.txt -- isn't that technically speaking the same?!

It is; most of the time people are interested in HGTs rather than non-HGTs and filtering low probable ones with the parameters speeds up the runtime. That is why they are not considered for downstream analyses, including avp classify .

If that is useful though, I can modify the script to take into account the proteins that were not selected.

Hope that helps, Georgios

GDKO avatar Sep 12 '24 12:09 GDKO

I did run everything (i.e. all steps there are in avp including classify, and even hgt_local_score), it didn't change anything.

For me, it would be helpful indeed. But I can also do it by hand.

bheimbu avatar Sep 12 '24 12:09 bheimbu

I did run everything (i.e. all steps there are in avp including classify, and even hgt_local_score), it didn't change anything.

Strange, can you send me the folder (zip) to my personal email?

For me, it would be helpful indeed. But I can also do it by hand.

Ok! I will add this functionality

GDKO avatar Sep 12 '24 12:09 GDKO

You mean the two folders, one for…

# prepare
ai_cutoff: 30
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "(ai or ahs) and outg_pct" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"

…and the other one for…

# prepare
ai_cutoff: -500
ahs_cutoff: 0
outg_pct_cutoff: 80
selection: "ai" # select sequences based on which metrics, another example "(ai or ahs) and outg_pct"

?

bheimbu avatar Sep 12 '24 12:09 bheimbu

Just for the second one

GDKO avatar Sep 12 '24 12:09 GDKO

sent it via email.

thx Bastian

bheimbu avatar Sep 12 '24 12:09 bheimbu

Hi,

did you find the problem?

Cheers Bastian

bheimbu avatar Sep 13 '24 07:09 bheimbu

Hi @bheimbu,

I have released a new version.

You can now provide the [ai.out] to avp classify with -a and it will add the number of excluded queries to the Ingroup classification.

Cheers, Georgios

GDKO avatar Sep 13 '24 08:09 GDKO

Many thx.

Cheers

bheimbu avatar Sep 13 '24 08:09 bheimbu

After downloading the latest release,

I get the following error related to avp prepare

$ ../avp prepare \
> -a output/"$species"/"$species".out_ai.out \
> -o output/"$species" \
> -f input/"$species".fa \
> -b output/"$species"/"$species".out \
> -x groups.yaml \
> -c config.yaml
[+] Setting up
Traceback (most recent call last):
  File "/scratch1/users/bheimbu/orthologs_oribatida/avp/08082024/../avp", line 7, in <module>
    main()
  File "/scratch1/users/bheimbu/orthologs_oribatida/avp/depot/interface.py", line 29, in main
    prepare.main()
  File "/scratch1/users/bheimbu/orthologs_oribatida/avp/depot/prepare.py", line 105, in main
    number_hits_noingroup = config_opts["number_hits_noingroup"]
                            ~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'number_hits_noingroup'

Cheers Bastian

bheimbu avatar Sep 13 '24 08:09 bheimbu

Hi @bheimbu,

Sorry about that! A new parameter was added in the version 1.0.6 to the config file.

Added parameter number_hits_noingroup to define the max number of hits that are selected for a query when there is no ingroup hit (previously fixed at 50).

Just add number_hits_noingroup: 50 after cutoffextend: 20.

Cheers, Georgios

GDKO avatar Sep 13 '24 08:09 GDKO

Ok,

now it runs.

Thx Bastian

bheimbu avatar Sep 13 '24 08:09 bheimbu