funannotate icon indicating copy to clipboard operation
funannotate copied to clipboard

Error parsing XML GO terms: None is not a valid term

Open NicMAlexandre opened this issue 3 years ago • 29 comments

Hello,

I am using funannotate 1.8.13 and am trying to integrate my interproscan results with my gff file.

funannotate annotate --gff ../GFFfiles/S1_mod.gff --fasta ../Genomefiles/S1.fasta
-s "S1" --iprscan S1.xml --eggnog S1.emapper.annotations --cpus 6 --tmpdir tmp -o S1_FUN

Everything runs smoothly until the iprscan step and I get the following error:

Error parsing XML GO terms: None is not a valid term

NicMAlexandre avatar Oct 29 '22 18:10 NicMAlexandre

Here is the head of the XML file

<?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.59-91.0">
  <protein>
    <sequence md5="8a46573d58fd2e105b5d8227492f1680">MNLDRLRKRVRHYIDQQQYQSALFWADKVSSLSHEDPQDIYWLAQCLYLTAQYHRASHALRSRKLDKLYGACQYLAARCHYAAKEYQQALDILDMEEAASKRLLDKNVKEDNGSRETVKEWEMSPASINSSICLLRGKIYDAMDNRPLATSSYKEALKLDVYCFEAFDLLTSHHMLTAQEEKDFLDSLPLSQQCTEEEVELLRFLFENKLKKYNKPSEMVVPDIVNGLQDNLDVVVSLAERHYYNCDFKMCYSLTSMVMVKDPFHANCLPVHIGTLVELSKANELFYLSHKLVDLYPSNPVSWFAVGCYYLMVGHKNEHARRYLSKATTLERTYGPAWIAYGHSFAVESEHDQAMAAYFTAAQLMKGCHLPMLYIGLEYGLTNNPKLAERFFSQALSIAPEDPFVIHEVAVVAFQNGDYKTAEKLFLDAMDKIKAIGNEVTVDKWEPLLNNLGHVCRKLKKYDQALEYHRQALVLIPQNASTYSAIGYVHSLMGDFESAIDYFHTALGLKRDDTFSVTMLGHCIEMYISDSDAYIGTDIKDKVRKTLGTPALMKMLNTSTEANESRAQPVEEVNVCLETPSFNADKQTDAFQRFLLECDMHENDMMLETSMSDTST</sequence>
    <xref id="anno1.g19174.t1" name="anno1.g19174.t1 gene=g_18579 seq_id=Chr9 type=cds"/>
    <matches>
      <hmmer2-match evalue="1.3E-26" score="104.5">
        <signature ac="SM00028" name="tpr_5">
          <entry ac="IPR019734" desc="Tetratricopeptide repeat" name="TPR_repeat" type="REPEAT">
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0005515" name="protein binding"/>
            <pathway-xref db="MetaCyc" id="PWY-8238" name="24-epi-campesterol, fucosterol, and clionasterol biosynthesis (diatoms)"/>

NicMAlexandre avatar Oct 30 '22 20:10 NicMAlexandre

I tried regenerating the file with fun annotate iprscan and I'm getting the same errors.

funannotate iprscan -i ../Proteinfiles/XXX -m local -c 16 -o XXX.xml --iprscan_path /global/scratch/users/XXX/interproscan-5.59-91.0/interproscan.sh

NicMAlexandre avatar Oct 31 '22 18:10 NicMAlexandre

Is any intermediate file made for the iprscan result in the annotate_misc folder ?

I'll test again on our ipr results but never seen that error need to see where it is coming in.

hyphaltip avatar Nov 07 '22 14:11 hyphaltip

Hi, I am having the same issue here

I run ips on my own with interproscan-5.59-91.0

BaiweiLo avatar Nov 09 '22 10:11 BaiweiLo

It is probably a change in XML format. Can you generate a complete xml file from something like 10 proteins so we can test and fix?

nextgenusfs avatar Nov 09 '22 15:11 nextgenusfs

I generated this file with test fasta provided by interproscan. test_proteins_redundant.fasta.txt

In case the above does not reproduce the error, these are the first ten sequences from my annotation test.xml.txt

I hope this helps, thank you very much!!

BaiweiLo avatar Nov 09 '22 16:11 BaiweiLo

Confirmed I can reproduce the error:

(py3-funannotate) jon@Jons-MacBook-Pro:~/Downloads$ python -m funannotate.aux_scripts.iprscan2annotations test_proteins_redundant.fasta.txt test.parsed.annotations.txt
Error parsing XML GO terms: None is not a valid term

How did you run interproscan this for the test proteins, was it using funannotate iprscan?

nextgenusfs avatar Nov 09 '22 16:11 nextgenusfs

I generated it with the following command interproscan-5.59-91.0/interproscan.sh -i $protein.fa

I am using funannotate v1.8.11. When I tried to run funannotate iprscan it output empty xml files

BaiweiLo avatar Nov 09 '22 16:11 BaiweiLo

what happens if you run it with the proper flags, probably just the goterms is the problem.

interproscan-5.59-91.0/interproscan.sh -i $protein.fa -f XML -goterms -pa

nextgenusfs avatar Nov 09 '22 16:11 nextgenusfs

So funannotate iprscan is just a convenience wrapper -- it requires that you modify your interproscan.properties file to be single threaded, so the wrapper will split the input fasta into multiple chunks and launch parallel interproscan processes on those chunks -- this is much faster than running it with the same resources using a single processes with multiple threads/cpus/etc.

Running it should be something like this:

funannotate iprscan -i funannotate_output_folder -m local --iprscan_path /full/path/to/your/interprscan.sh --cpus 12

nextgenusfs avatar Nov 09 '22 17:11 nextgenusfs

I can check the intermediate files in a few days I’m traveling abroad, but I ran funannotate iprscan exactly as you suggest here with the same errors. Are you saying interproscan.sh should be run manually outside of funannotate without the -goterns argument?

On Thu, Nov 10, 2022 at 1:09 AM Jon Palmer @.***> wrote:

So funannotate iprscan is just a convenience wrapper -- it requires that you modify your interproscan.properties file to be single threaded, so the wrapper will split the input fasta into multiple chunks and launch parallel interproscan processes on those chunks -- this is much faster than running it with the same resources using a single processes with multiple threads/cpus/etc.

Running it should be something like this:

funannotate iprscan -i funannotate_output_folder -m local --iprscan_path /full/path/to/your/interprscan.sh --cpus 12

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1309073754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB633Y34OWRBLI6KB6JOQ3WHPLGBANCNFSM6AAAAAARR426HE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Nov 09 '22 21:11 NicMAlexandre

I also ran interproscan.sh manually as you suggest with the same error. All proper flags were used.

On Thu, Nov 10, 2022 at 5:18 AM Nicolas Alexandre @.***> wrote:

I can check the intermediate files in a few days I’m traveling abroad, but I ran funannotate iprscan exactly as you suggest here with the same errors. Are you saying interproscan.sh should be run manually outside of funannotate without the -goterns argument?

On Thu, Nov 10, 2022 at 1:09 AM Jon Palmer @.***> wrote:

So funannotate iprscan is just a convenience wrapper -- it requires that you modify your interproscan.properties file to be single threaded, so the wrapper will split the input fasta into multiple chunks and launch parallel interproscan processes on those chunks -- this is much faster than running it with the same resources using a single processes with multiple threads/cpus/etc.

Running it should be something like this:

funannotate iprscan -i funannotate_output_folder -m local --iprscan_path /full/path/to/your/interprscan.sh --cpus 12

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1309073754, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB633Y34OWRBLI6KB6JOQ3WHPLGBANCNFSM6AAAAAARR426HE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Nov 09 '22 21:11 NicMAlexandre

It's possible that the XML tags have changed in newer versions of interproscan -- I haven't had time to test. I have a day job and a family, so its not easy to find time... But its trying to parse GO terms from the results, so if you did not run interproscan with the -goterms flag that might be one reason it fails. If that is the case, the parser needs to be updated so it silently skips these tags.

nextgenusfs avatar Nov 09 '22 21:11 nextgenusfs

No worries at all I totally understand that :) what is the latest version you have used of interproscan that should work?

On Thu, Nov 10, 2022 at 5:52 AM Jon Palmer @.***> wrote:

It's possible that the XML tags have changed in newer versions of interproscan -- I haven't had time to test. I have a day job and a family, so its not easy to find time... But its trying to parse GO terms from the results, so if you did not run interproscan with the -goterms flag that might be one reason it fails. If that is the case, the parser needs to be updated so it silently skips these tags.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1309423786, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB6335JNENZHRS4TFXYD3TWHQMJXANCNFSM6AAAAAARR426HE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Nov 09 '22 22:11 NicMAlexandre

I'm using 5.52-86.0 and I know that works. I don't upgrade it that often I guess.... maybe once per year. But it shouldn't be hard to upgrade parser for the apparent change in the go terms in xml file, just need to figure out what they are.

So it seems unrelated to the -goterms option as with 5.52-86.0 if I run the test proteins with and without the -goterms and then parse the output I get no errors:

$ interproscan.sh -i test_proteins.fasta -d . -f XML -pa
$ python -m funannotate.aux_scripts.iprscan2annotations test_proteins.fasta.xml test_proteins.nogos.txt
$ cat test_proteins.nogos.txt
A0B6J9  db_xref InterPro:IPR002915
A0B6J9  db_xref InterPro:IPR013785
A0B6J9  db_xref InterPro:IPR010210
A0B6J9  db_xref InterPro:IPR041720
A0B6J9  go_function     lyase activity|0016829||IEA
A0B6J9  go_function     catalytic activity|0003824||IEA
A0B6J9  go_process      aromatic amino acid family biosynthetic process|0009073||IEA
A0B6J9  go_function     hydro-lyase activity|0016836||IEA
A0B6J9  go_function     fructose-bisphosphate aldolase activity|0004332||IEA
A2YIW7  db_xref InterPro:IPR013766
A2YIW7  db_xref InterPro:IPR005746
A2YIW7  db_xref InterPro:IPR017937
A2YIW7  db_xref InterPro:IPR036249
A2YIW7  go_process      glycerol ether metabolic process|0006662||IEA
A2YIW7  go_function     protein-disulfide reductase activity|0015035||IEA
Q97R95  db_xref InterPro:IPR001057
Q97R95  db_xref InterPro:IPR002478
Q97R95  db_xref InterPro:IPR011529
Q97R95  db_xref InterPro:IPR005715
Q97R95  db_xref InterPro:IPR036393
Q97R95  db_xref InterPro:IPR001048
Q97R95  db_xref InterPro:IPR036974
Q97R95  db_xref InterPro:IPR019797
Q97R95  db_xref InterPro:IPR041739
Q97R95  db_xref InterPro:IPR015947
Q97R95  go_function     RNA binding|0003723||IEA
Q97R95  go_process      proline biosynthetic process|0006561||IEA
Q97R95  go_function     glutamate 5-kinase activity|0004349||IEA
Q97R95  go_component    cytoplasm|0005737||IEA
A2VDN9  db_xref InterPro:IPR001313
A2VDN9  db_xref InterPro:IPR011989
A2VDN9  db_xref InterPro:IPR012959
A2VDN9  db_xref InterPro:IPR040059
A2VDN9  db_xref InterPro:IPR033133
A2VDN9  db_xref InterPro:IPR016024
A2VDN9  go_function     RNA binding|0003723||IEA
P22298  db_xref InterPro:IPR008197
P22298  db_xref InterPro:IPR036645
P22298  go_component    extracellular region|0005576||IEA
P22298  go_function     peptidase inhibitor activity|0030414||IEA
P02939  db_xref InterPro:IPR006817
P02939  db_xref InterPro:IPR016367
P02939  go_component    outer membrane|0019867||IEA

nextgenusfs avatar Nov 09 '22 22:11 nextgenusfs

Amazing thank you so much!

On Thu, Nov 10, 2022 at 6:47 AM Jon Palmer @.***> wrote:

I'm using 5.52-86.0 and I know that works. I don't upgrade it that often I guess.... maybe once per year. But it shouldn't be hard to upgrade parser for the apparent change in the go terms in xml file, just need to figure out what they are.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1309491082, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB6333MGFWOSBULUJG6MM3WHQSYTANCNFSM6AAAAAARR426HE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Nov 10 '22 00:11 NicMAlexandre

There are xml to other format converts in iprscan package i wonder if cleaner to run that to generate tsv from the xml and then it will always be internally consistent? I also know the reformatting xml reading takes a ton of ram in the annotate step because of the xml workaround tweaks we had to do in past.

hyphaltip avatar Nov 10 '22 02:11 hyphaltip

I'd actually prefer to use JSON format.... but we can't really use anything in the interproscan package as it's not an explicit dependency and I don't want to make it one. Just need to figure out most robust way to parse it reliably.

nextgenusfs avatar Nov 10 '22 02:11 nextgenusfs

But that is something to try here if we convert to JSON do we end up with same results or are the different versions yielding different key value pairs.

nextgenusfs avatar Nov 10 '22 03:11 nextgenusfs

Ah I see, so are you saying I should run this with interproscan.sh manually and generate multiple format types, then just do a conversion from the json format to xml?

On Thu, Nov 10, 2022 at 11:25 AM Jon Palmer @.***> wrote:

But that is something to try here if we convert to JSON do we end up with same results or are the different versions yielding different key value pairs.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1309717801, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB633YPBV4W7RGTXZO34ETWHRTK3ANCNFSM6AAAAAARR426HE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Nov 10 '22 04:11 NicMAlexandre

Currently it only supports interpro xml as input so you have to start there.

hyphaltip avatar Nov 25 '22 19:11 hyphaltip

Cool, thanks! That makes sense.

On Fri, Nov 25, 2022 at 11:43 AM Jason Stajich @.***> wrote:

Currently it only supports interpro xml as input so you have to start there.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1327836840, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB6333GKAMSSNQLHSF33E3WKEJHLANCNFSM6AAAAAARR426HE . You are receiving this because you authored the thread.Message ID: @.***>

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Nov 25 '22 23:11 NicMAlexandre

Okay I'm hitting this problem after an upgrade of iprscan on our system... I can see that older versions of iprscan showed these types of XML: This was old format which supported the category info: version: <?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.55-88.0">

    <xref id="NW764_002744-T1" name="NW764_002744-T1 NW764_002744"/>
    <matches>
      <hmmer3-match evalue="4.3E-34" score="118.0">
        <signature ac="PF07690" desc="Major Facilitator Superfamily" name="MFS_1">
          <entry ac="IPR011701" desc="Major facilitator superfamily" name="MFS" type="FAMILY">
            <go-xref category="BIOLOGICAL_PROCESS" db="GO" id="GO:0055085" name="transmembrane transport"/>
            <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0022857" name="transmembrane transporter activity"/>

now the GO associated info is much less descriptive so to get the category and name we will have to do a GO lookup? version: <?xml version="1.0" encoding="UTF-8"?><protein-matches xmlns="http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5" interproscan-version="5.60-92.0">

       <go-xref db="GO" id="GO:0005737"/>
        <go-xref db="GO" id="GO:0110165"/>
        <go-xref db="GO" id="GO:0009987"/>
        <go-xref db="GO" id="GO:0070727"/>
        <go-xref db="GO" id="GO:0050789"/>

hyphaltip avatar Feb 09 '23 21:02 hyphaltip

Always awesome with the format changes.... GO lookup is straightforward as the go.obo is in the database already. Not the exact code you need, but here is how its done on the fly for generating the tbl format. https://github.com/nextgenusfs/funannotate/blob/master/funannotate/library.py#L2937-L2959

nextgenusfs avatar Feb 09 '23 21:02 nextgenusfs

I already started a working solution pulling in the obo file so I will have this finished testing tomorrow I hope

hyphaltip avatar Feb 10 '23 01:02 hyphaltip

upon more search it seems like we have combination of the <go-xref db="GO" id="GO:0050789"/> and <go-xref category="MOLECULAR_FUNCTION" db="GO" id="GO:0022857" name="transmembrane transporter activity"/> - so will need to see if both should be attributed?

hyphaltip avatar Feb 10 '23 06:02 hyphaltip

I think this has solved it in my tests https://github.com/nextgenusfs/funannotate/tree/iprscanjson - I will do a bit more testing and run it a pangenome annotation I am finishing up - if this succeeds I'll merge this to master branch.

hyphaltip avatar Feb 10 '23 18:02 hyphaltip

It works for me installed from GitHub. Have you done that? We haven’t made a new release.

On Mon, Feb 20, 2023 at 11:09 AM Hassan Tarabai @.***> wrote:

Hello,

Any update on the issue? I am running InterProScan-5.60-92.0 with a new funannotate installation and still have this error.

— Reply to this email directly, view it on GitHub https://github.com/nextgenusfs/funannotate/issues/830#issuecomment-1437442871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAL5OZCG7OHEXJBJ7GM5NDWYO6N3ANCNFSM6AAAAAARR426HE . You are receiving this because you commented.Message ID: @.***>

-- Sent from Gmail Mobile

Jason Stajich - @.***

hyphaltip avatar Feb 23 '23 02:02 hyphaltip

I've seen several error reports affected by this error in Galaxy, cool to see a fix in #865! Is there a new release including this fix planned anytime soon?

abretaud avatar Mar 02 '23 15:03 abretaud