AvP icon indicating copy to clipboard operation
AvP copied to clipboard

Error parsing GFF3 file for AvP hgt local score analysis

Open aldendirks opened this issue 2 years ago • 9 comments

Your software has worked really well for me! I'm now trying to run the HGT local score script but am getting errors parsing the GFF3 file.

Traceback (most recent call last):
  File "/nfs/turbo/lsa-tyjames/mycology/Alden/software/AvP/aux_scripts/hgt_local_score.py", line 169, in <module>
    main()
  File "/nfs/turbo/lsa-tyjames/mycology/Alden/software/AvP/aux_scripts/hgt_local_score.py", line 86, in main
    gene_scaf = gene_location[gene]
                ~~~~~~~~~~~~~^^^^^^
KeyError: 'Hydnotrya_cerebriformis_MICH67763_scaffold_3601_FUN_011721'

I wasn't sure if the GFF3 file should be the full annotations (gene, mRNA, exon, and CDS) or just the lines that are annotated as "gene" or just those annotated as "mRNA". I've tried all the different combos and it gives the same error. When I grep that ID it is found as a match in the file. My GFF3 has 9 columns in the same format specified so I'm not sure why it can't find the right line.

aldendirks avatar Nov 04 '23 17:11 aldendirks

It ended up working with just the mRNA lines of the GFF3 file. There was a -T1 appended to the IDs from Funannotate so I had to remove those then it worked. Thanks!

aldendirks avatar Nov 04 '23 17:11 aldendirks

Hi,

get the same error:

$ ../../aux_scripts/hgt_local_score.py  -f NANCO@999999009@3_mRNA.gff -a 2000bp_og_core.out_ai.out_NANCO -t fasttree_general_results_NANCO.txt -m 0
Traceback (most recent call last):
  File "/scratch1/users/bheimbu/avp/oribatida/hgt_local_score/../../aux_scripts/hgt_local_score.py", line 169, in <module>
    main()
  File "/scratch1/users/bheimbu/avp/oribatida/hgt_local_score/../../aux_scripts/hgt_local_score.py", line 86, in main
    gene_scaf = gene_location[gene]
                ~~~~~~~~~~~~~^^^^^^
KeyError: 'scaffold7841_size5308'

I have no clue how to fix this, names in all files are the same like scaffold7841_size5308.

My gff file looks like this...

scaffold100001_size446	MetaEuk	mRNA	133	432	134	-	.	ID=mRNA1;Parent=gene1
scaffold100004_size446	MetaEuk	mRNA	6	446	140	+	.	ID=mRNA2;Parent=gene2
scaffold100006_size446	MetaEuk	mRNA	191	442	74	-	.	ID=mRNA3;Parent=gene3
scaffold10000_size4538	MetaEuk	mRNA	2369	3908	412	+	.	ID=mRNA4;Parent=gene4
scaffold100010_size446	MetaEuk	mRNA	97	267	63	-	.	ID=mRNA5;Parent=gene5
scaffold100017_size446	MetaEuk	mRNA	4	162	67	-	.	ID=mRNA6;Parent=gene6
scaffold100018_size446	MetaEuk	mRNA	282	431	59	-	.	ID=mRNA7;Parent=gene7
scaffold10001_size4538	MetaEuk	mRNA	3920	4036	54	-	.	ID=mRNA8;Parent=gene8
scaffold100023_size446	MetaEuk	mRNA	2	160	78	+	.	ID=mRNA9;Parent=gene9
scaffold100029_size446	MetaEuk	mRNA	3	386	94	-	.	ID=mRNA10;Parent=gene10
scaffold10002_size4538	MetaEuk	mRNA	618	1145	143	+	.	ID=mRNA11;Parent=gene11
scaffold10002_size4538	MetaEuk	mRNA	2501	3688	456	-	.	ID=mRNA12;Parent=gene12

fasttree_general_results.txt like this...

NO	2000bp_og_core/mafftgroups/gp1.fa	/scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp1.fa.fasttree	scaffold7428_size4981
HGT	2000bp_og_core/mafftgroups/gp17.fa	/scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp17.fa.fasttree	scaffold7841_size5308
COMPLEX	2000bp_og_core/mafftgroups/gp33.fa	/scratch1/users/bheimbu/avp/oribatida/2000bp_og_core/fasttree/gp33.fa.fasttree	scaffold5080_size6865

and *out_ai.out file like this...

query name	donor	ingroup	AI	HGTindex	query hits number	AHS	outg_pct
scaffold7428_size4981	Uniref90|UniRef90_UPI00236585EA:16:65.6:0.0:930	Uniref90|UniRef90_A0AA38IBE4:2:73.1:0.0:1045	0.0	-115.0	500	214760.3805765278	93
scaffold15819_size3298	Uniref90|UniRef90_A0A918QQU7:11:65.7:0.0:937	Uniref90|UniRef90_A0AA38IBE4:2:72.6:0.0:1039	0.0	-102.0	500	216133.42592125913	93
scaffold7841_size5308	Uniref90|UniRef90_A0A919J7Y8:1:66.0:9.64e-45:154	Uniref90|UniRef90_A0A914DH25:302:56.6:2.39e-33:129	26.236393373249513	25.0	500	18459.662785971963	99
scaffold7428_size4981	Uniref90|UniRef90_UPI00236585EA:16:65.6:0.0:930	Uniref90|UniRef90_A0AA38IBE4:2:73.1:0.0:1045	0.0	-115.0	500	214718.58133127296	93
scaffold12196_size3834	Uniref90|UniRef90_A0A932I9T7:207:40.1:5.21e-100:317	Uniref90|UniRef90_UPI000719BA09:4:49.0:3.26e-123:380	-53.42830979924969	-63.0	500	-30065.837801365476	2
scaffold9970_size4547	::::	Uniref90|UniRef90_UPI0008114012:4:43.3:1.44e-138:420	-317.3920997195904	-420.0	500	-17870.776102456446	0

Cheers Bastian

bheimbu avatar Apr 11 '24 09:04 bheimbu

Hi @bheimbu,

Your scaffold names are also your gene names. So there is some issue there with how you generated your input.

Cheers, Georgios

GDKO avatar Apr 11 '24 09:04 GDKO

Hi @GDKO,

please can you check my attach files, if there are correct? However, I still get the same error (KeyError: 'A7LXT0_NANCO_999999009_3_scaffold7841_size5308' ), so obviously not.

2000bp_og_core.out_ai.out_NANCO.txt fasttree_general_results_NANCO.txt NANCO@999999009@3_mRNA.gff.txt

I didn't do the annotation, got the annotations from someone else.

Cheers Bastian

bheimbu avatar Apr 11 '24 09:04 bheimbu

Hi @bheimbu,

Your gene names in 2000bp_og_core.out_ai.out_NANCO.txt and fasttree_general_results_NANCO.txt are the same. However, your gene names in the gff file are not the same as the ones you used for running AvP. In the gff file they are named mRNA1, mRNA2 ... etc.

Cheers, Georgios

GDKO avatar Apr 11 '24 09:04 GDKO

Hi @GDKO

that is, I should rerun the annotation or how to proceed?

To give you a bit more background info: I've used fDOG to search for orthologs of bacterial and fungal origin in oribatid genomes. The found orthologs were then used to search for HGTs with avp. That's why my names are so differently between gff and other files.

Cheers Bastian

bheimbu avatar Apr 11 '24 10:04 bheimbu

Hi @bheimbu,

The hgt_local_score was designed to check the locality of the genes detected as potential HGTs from AvP in the genome assembly. The idea is that if a gene is an HGT then it will be in a region surrounded by non-HGT genes. There can be a case of course that a bigger region was inserted into the genome. So this score can identify regions which are either contamination in the genome assembly or HGT-rich regions.

I think you have misunderstood the purpose of this score. To run this program you need the genes and the gff of one assembly. For example, you have data from one oribatid genome. You run the AvP pipeline to identify potential HGT candidates from your gene set and then you run hgt_local_score to see in which regions in your genome those genes fall.

Cheers, Georgios

GDKO avatar Apr 15 '24 08:04 GDKO

Hi @GDKO

I got the idea right, trust me. But I only want to know whether certain genes (found by fDOG) represent true HGTs or not. It's actually just a matter of how my data (i.e. gff files and output from fDOG) assigns names, right? So what I have to do is to manipulate my data to get it in the right format.

My question would be: Do you have an example data set to see how the files should look like?

Cheers Bastian

bheimbu avatar Apr 15 '24 09:04 bheimbu

See here https://github.com/GDKO/AvP/wiki/tutorial

GDKO avatar Apr 15 '24 09:04 GDKO

Added a script to rename MetaEuk output files.

GDKO avatar Jul 22 '24 09:07 GDKO