Update fails : no ID
Are you using the latest release? I am using version 1.8.7
Describe the bug Funannotate update fails at the end while comparing annotations. 2 models have no ID. I ran train and predict, both ran successfully and the output is in a directory called 'fun'
What command did you issue? funannotate update -i fun --cpus 8
Logfiles I had a very similar issue to issue #506, but I couldn't find the solution there. Here's the check-version, log files and tbl file. I have two loci without any ID: FUN_004038, FUN_012373
Here's what the error looked like:
[Jul 31 04:10 PM]: Writing 12,932 loci to TBL format: dropped 0 overlapping, 2 too short, and 0 frameshift gene models [Jul 31 04:10 PM]: Converting to Genbank format [Jul 31 04:14 PM]: Collecting final annotation files [Jul 31 04:15 PM]: Parsing GenBank files...comparing annotation [Jul 31 04:15 PM]: putative transcript from ncbi:FUN_004038-T1 has no ID (ncbi:FUN_004038-T1 None ncbi:FUN_004038-T1) [Jul 31 04:15 PM]: putative transcript from ncbi:FUN_012373-T1 has no ID (ncbi:FUN_012373-T1 None ncbi:FUN_012373-T1) Traceback (most recent call last): File "/local/cluster/funannotate/bin/funannotate", line 10, in sys.exit(main()) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/funannotate.py", line 705, in main mod.main(arguments) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 2432, in main compareAnnotations2(GBK, final_gbk, Changes, args=args) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 1424, in compareAnnotations2 newGenes[gene[2]]['CDS'], hitInfo['CDS']) File "/local/cluster/funannotate-1.8.7/lib/python3.7/site-packages/funannotate/update.py", line 1559, in pairwiseAED splitAED = [pAED[i:i+len(query)] for i in range(0, len(pAED), len(query))] ValueError: range() arg 3 must not be zero
fun_check_version.txt Rhizopogon_salebrosus.tbl.txt update_log_7_30.e9819512.txt
I don't see anything wrong in the tbl file for these two models, other than they are partial. In this function the script is actually parsing the genbank file -- can you post what the gene models look like in genbank format? (you can also just post the genbank file and I can look).
I'm having difficulties uploading the genbank file so I will email you it for now, but here's the gene models from that file ///////////////////////////////////////////////// first model
gene 173086..>173862
/locus_tag="FUN_004038"
mRNA join(173086..173143,173244..173498,173554..173670,
173722..>173862)
/locus_tag="FUN_004038"
/product="hypothetical protein"
CDS join(173315..173498,173554..173670,173722..>173867)
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_004038-T1"
/translation="MDITQITQNSQSAQSSQSAKLNKCAAQRRYRKKEGQGFLQLRQA
LKEVSKDPRSKQDTLRRASSELRRLADENKLLLQAIIEAWNNTGEFDKDDHLRPASDE
DPLLCMEWWSHDTTKMADITRSDMARHCVLSASVLPTINMFTSPEXX"
assembly_gap 173863..174515
/estimated_length=653
/gap_type="within scaffold"
/linkage_evidence="paired-ends"
ORIGIN .(etc)
//////////////////////////////////////////////////////////////////////// next model
LOCUS scaffold_1074 2585 bp DNA linear PLN 31-JUL-2021
DEFINITION Rhizopogon salebrosus.
ACCESSION
VERSION
KEYWORDS .
SOURCE Rhizopogon salebrosus
ORGANISM Rhizopogon salebrosus
Eukaryota; Fungi; Dikarya; Basidiomycota; Agaricomycotina;
Agaricomycetes; Agaricomycetidae; Boletales; Suillineae;
Rhizopogonaceae; Rhizopogon.
REFERENCE 1 (bases 1 to 2585)
AUTHORS Palmer,J.M.
TITLE Direct Submission
JOURNAL Submitted (31-JUL-2021) CFMR, USDA Forest Service, 1 Gifford
Pinchot Drive, Madison, WI 53726, USA
COMMENT 'Annotated using 1.8.7'.
FEATURES Location/Qualifiers
source 1..2585
/organism="Rhizopogon salebrosus"
/mol_type="genomic DNA"
/db_xref="taxon:176626"
gene 1..1042
/locus_tag="FUN_012372"
mRNA join(1..234,281..337,395..419,478..606,664..786,834..1042)
/locus_tag="FUN_012372"
/product="hypothetical protein"
CDS join(63..234,281..337,395..419,478..606,664..786,834..924)
/locus_tag="FUN_012372"
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_012372-T1"
/translation="MFTLIMTNNPEIQERAQSEIDKVVGFNRLPNFNDLPALPYVGAL
IREMKRWHPVGPLGMAHSNMEDDFYKGYFIPKGASIVPNIWAMAHNPKKYPEPECFRP
DRFLKPDGTLNDDTLPWIFGFGRRRCPGTHVADASLWCAITCILATFTIKKLVGQEPE
IKWASGLSSYPLPFPCQFVPRKARDAQALTNLVQNSLL"
gene 1156..>1683
/locus_tag="FUN_012373"
mRNA join(1156..1231,1285..1421,1479..>1683)
/locus_tag="FUN_012373"
/product="hypothetical protein"
CDS join(1208..1231,1285..1421,1479..>1685)
/codon_start=1
/product="hypothetical protein"
/protein_id="ncbi:FUN_012373-T1"
/translation="MVMLVHAEHPPSPFIPRDQPEVVVSKHVKRHNDSVASKRSPTVG
ALDDHDGVYLKPNPKLCVTFVGMGSQHPAMGRQLADKYPSFLASIKENDRILMEVYGQ
PSLLDRTGLFVPGAKCDLSP"
assembly_gap 1684..1698
/estimated_length=15
/gap_type="within scaffold"
/linkage_evidence="paired-ends"
ORIGIN
oops didnt mean to close it :)
Okay, I see the problem now. Neither of those two models have locus_tags in the CDS feature. Both of which seem to be because they abut an assembly gap and for whatever reason NCBI's tbl2asn didn't write a locus_tag for the feature. I'm not sure why.... The code complains that the feature is not valid because it does not have a locus_tag -- without a locus_tag it is unable to assign the CDS to a gene model. NCBI GenBank format is odd in that there is actually nothing required to link the gene, mRNA, and CDS -- to make it worse, sometimes tbl2asn doesn't write features in the same order.
When I take a closer look at the tbl file you sent I think I see the problem. The CDS feature is actually outside of the gene and mRNA feature, for example:
173086 >173862 gene
locus_tag FUN_004038
173086 173143 mRNA
173244 173498
173554 173670
173722 >173862
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
173315 173498 CDS
173554 173670
173722 >173867
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
So I think this means there must be a problem with the adjustment of this gene model in relation to the sequencing gap. The end coordinate of the CDS model should likely read >173862 or the gene and mRNA features should also extend to 173867.
The same is true of the other model:
1156 >1683 gene
locus_tag FUN_012373
1156 1231 mRNA
1285 1421
1479 >1683
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
1208 1231 CDS
1285 1421
1479 >1685
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
I need to see if I can find where this might be happening, ie if it is something that funannotate is doing or if it just comes from PASA directly and is parsed incorrectly.
@skylerhar1 can you pull out the tbl format for these gene model from funannotate predict, ie in the predict_results/*.tbl file? That will help us determine if PASA modified these to be problematic of if they were that way from predict. Thanks.
Here it is! (converted to a txt file for github) Rhizopogon_salebrosus_fun_predict.txt
Okay, so PASA is making these changes as the model is different -- and then introducing these "errors". I should be able to find a way to fix them in the update script.
173315 >173808 gene
locus_tag FUN_004038
173315 173498 mRNA
173554 173670
173722 >173808
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
173315 173498 CDS
173554 173670
173722 >173808
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_004038-T1_mrna
protein_id gnl|ncbi|FUN_004038-T1
1208 >1421 gene
locus_tag FUN_012373
1208 1231 mRNA
1285 >1421
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
1208 1231 CDS
1285 >1421
codon_start 1
product hypothetical protein
transcript_id gnl|ncbi|FUN_012373-T1_mrna
protein_id gnl|ncbi|FUN_012373-T1
sounds good, let me know if there are any updates
Hi @skylerhar1 sorry for the delay. I'm unable to test this locally because I don't have a dataset that exhibits this behavior. So I pushed some updated code to a new branch and I will need your help to test.
You can install the updated code with this command from your funannotate environment:
python -m pip install git+https://github.com/nextgenusfs/funannotate.git@pasa_update_fix
And then if you could re-run your update command and see what the behavior is, ie does it error out? And then furthermore if you could get me these two gene models FUN_004038 and FUN_012373 I need to see if their coordinates are now fixed in relation to the sequence gap adjacent to each model.
Should I delete the update_results and update_misc files from the directory and then rerun update, or should I leave them in.
On Sat, Sep 11, 2021 at 11:39 AM Jon Palmer @.***> wrote:
[This email originated from outside of OSU. Use caution with links and attachments.]
Hi @skylerhar1 https://github.com/skylerhar1 sorry for the delay. I'm unable to test this locally because I don't have a dataset that exhibits this behavior. So I pushed some updated code to a new branch and I will need your help to test.
You can install the updated code with this command from your fun annotate environment:
python -m pip install @.***_update_fix
And then if you could re-run your update command and see what the behavior is, ie does it error out? And then furthermore if you could get me these two gene models FUN_004038 and FUN_012373 I need to see if their coordinates are now fixed in relation to the sequence gap adjacent to each model.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/624#issuecomment-917454214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQD3U45T4O6IWTO7LFAXUQDUBOO6PANCNFSM5BUPP7ZQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
-- Sincerely, Skyler Har
If you could rename the update_misc and update_results folder that would be good. Then it will run from beginning and we can compare results from old versus new.
I'm a little confused, the coordinates are now different by quite a bit with these gene models. You sure this is the same input from funannotate predict? It seems like PASA this time either didn't call gene models in these regions or they were substantially changed?
Thats my bad, I will update you with the correct tbl results soon
Sorry about that, I accidentally used data from a different trial run. Heres the correct tbl file:
173086 >173862 gene locus_tag FUN_004038 173086 173143 mRNA 173244 173498 173554 173670 173722 >173862 product hypothetical protein transcript_id gnl|ncbi|FUN_004038-T1_mrna protein_id gnl|ncbi|FUN_004038-T1 173315 173498 CDS 173554 173670 173722 >173086 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_004038-T1_mrna protein_id gnl|ncbi|FUN_004038-T1
1156 >1683 gene locus_tag FUN_012373 1156 1231 mRNA 1285 1421 1479 >1683 product hypothetical protein transcript_id gnl|ncbi|FUN_012373-T1_mrna protein_id gnl|ncbi|FUN_012373-T1 1208 1231 CDS 1285 1421 1479 >1156 codon_start 1 product hypothetical protein transcript_id gnl|ncbi|FUN_012373-T1_mrna protein_id gnl|ncbi|FUN_012373-T1
additionally it seems those two gene models were special because while there were 44 gene models that needed correcting, these two had special errors: FUN_000096 Feature begins or ends in gap starting at 214116 FUN_000427 Feature begins or ends in gap starting at 191884 FUN_001052 Feature begins or ends in gap starting at 541846 FUN_001326 Feature begins or ends in gap starting at 278278 FUN_001490 Feature begins or ends in gap starting at 278911 FUN_002558 Feature begins or ends in gap starting at 288775 FUN_003011 Feature begins or ends in gap starting at 131505 FUN_003085 Feature begins or ends in gap starting at 64330 FUN_003209 Feature begins or ends in gap starting at 160428 FUN_003251 Feature begins or ends in gap starting at 33414 FUN_003713 Feature begins or ends in gap starting at 55898 FUN_003721 Feature begins or ends in gap starting at 83050 FUN_004038 Location: Mixed strands in SeqLoc [(lcl|scaffold_33:173315-173498, 173554-173670, c173722-<173086)] FUN_004054 Feature begins or ends in gap starting at 40758 FUN_004391 Feature begins or ends in gap starting at 96178 FUN_005024 Feature begins or ends in gap starting at 22438 FUN_005046 Feature begins or ends in gap starting at 80164 FUN_005685 Feature begins or ends in gap starting at 47866 FUN_006193 Feature begins or ends in gap starting at 35322 FUN_006237 Feature begins or ends in gap starting at 95310 FUN_006594 Feature begins or ends in gap starting at 11874 FUN_006755 Feature begins or ends in gap starting at 29046 FUN_007256 Feature begins or ends in gap starting at 61500 FUN_007718 Feature begins or ends in gap starting at 41587 FUN_008205 Feature begins or ends in gap starting at 6217 FUN_008884 Feature begins or ends in gap starting at 17110 FUN_008915 Feature begins or ends in gap starting at 7657 FUN_008932 Feature begins or ends in gap starting at 4013 FUN_009350 Feature begins or ends in gap starting at 6175 FUN_009476 Feature begins or ends in gap starting at 19797 FUN_009798 Feature begins or ends in gap starting at 15724 FUN_010163 Feature begins or ends in gap starting at 15695 FUN_010720 Feature begins or ends in gap starting at 8141 FUN_011473 Feature begins or ends in gap starting at 5125 FUN_011964 Feature begins or ends in gap starting at 2283 FUN_012373 Location: Mixed strands in SeqLoc [(lcl|scaffold_1074:1208-1231, 1285-1421, c1479-<1156)] FUN_012865 Feature begins or ends in gap starting at 581860 FUN_012869 Feature begins or ends in gap starting at 163870