GToTree icon indicating copy to clipboard operation
GToTree copied to clipboard

Universal ToL and Eukaryotes

Open molly-kholodova opened this issue 1 year ago • 12 comments

Hi Mike,

I'm a PhD student in Laura Hug's lab currently working on a side project related to constructing universal ToLs. I was following your ToL example using the universal marker gene set and noticed a big discrepancy when I try to run the same code.

I downloaded your code and original files (https://figshare.com/articles/dataset/GToTree_ToL_example_data/19372322) and ran it myself, so it should be using the exact same accessions that you fetched in 2018 (1698 genomes). However, the resulting tree has more organisms that were excluded on the basis of having too few hits (Genomes_removed_for_too_few_hits.tsv). Your run had 7, mine had 17. ToL_test_MC_Aug2024.zip

Taking a closer look at the NCBI_genomes_summary_info.tsv file, if I look up a eukaryotic accession from the Genomes_removed_for_too_few_hits.tsv file, it actually has numbers >1 in many of the columns. From what I can tell, it treats multiple hits for a gene as no hits. When I ran it again using -G 0 instead of -G 0.4 (I think this is the minimum genes for inclusion?) it was able to include the genomes again in the tree.

I understand you have a disclaimer about working with eukaryotes in GToTree in terms of completion and redundancy, but currently even reference eukaryotic genomes are unable to be placed, which is problematic for creating universal trees.

As a side note, an additional issue to consider when working with eukaryotic genomes is the possibility that mitochondrial or chloroplast genes could be mistaken for marker genes, potentially leading to misplacement or conflicting data. In cases of multiple hits, how does GToTree decide which one to use? If it is based on best match/lowest E value, then it would prioritize mitochondrial and chloroplast genes over eukaryotic versions. I ran into this a few times in my own project where it would place a big group of plants and photosynthetic eukaryotes next to the cyanobacteria.

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

Let me know if you want more info on anything I've mentioned here or would like to discuss further.

molly-kholodova avatar Sep 17 '24 09:09 molly-kholodova

Hey there, Molly :)

Thanks for writing in!

So the main thing you hit on (regarding how GToTree deals with multi-hits) has changed since i put the original paper out, so that is likely the primary cause of the discrepancy you saw.

Looking through the release notes, it seems it was Jan of 2021 with v1.1.1 when i implemented this change. Since then, if there are multiple hits to a target, GToTree takes the more conservative route by default and excludes that target-gene overall since it wasn't found in exactly 1 copy (therefore potentially leading to more dropped genomes for not hitting the minimum proportion of target genes, -G, that you adjusted). And i changed this default behavior for the very reasons you mention 👍 The option to run it the old way exists by turning on "best-hit mode" by adding the -B flag.

On to the rest of your message:

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

YES! I would love to implement your work to improve things on this front :)

The first easier step I can imagine is to just add the HMMs as a dedicated eukarya set if folks are making a tree with just eukarya.

And then i think a second step, which would take a bit more work, would be handling when working across domains and having some upfront way of determining which inputs are euks and should be treated differently like you're saying.

I don't think i'd want to employ it always when there are multiple hits to something, because that can be the case in non-euks too. But maybe i'm misunderstanding or mis-thinking something at the moment. But yea, i'd love to talk more! Maybe let's email and then can coordinate a time to hop on a call? You can reach me at [email protected]

And thanks again for writing in about this!

AstrobioMike avatar Sep 17 '24 13:09 AstrobioMike

Hi Mike,

Thanks for your quick response, and the note about best-hit mode, that clears things up a lot!

Yes, I'd be very happy to chat about this topic, will reach out shortly :)

molly-kholodova avatar Sep 17 '24 17:09 molly-kholodova

Hi both,

I have been playing around with GToTree and have encountered similar issues to those described by @molly-kholodova. Reading your discussion has been informative - thank you!

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

This sounds like a great solution - would you be willing to share your approach to building these HMMs, and/or the HMMs themselves?

Thanks!

ezherman avatar Nov 22 '24 10:11 ezherman

Hi Ezra,

Sure thing! I've uploaded them to a repository [https://github.com/molly-kholodova/HMMs].

In terms of constructing them:

  • I went on the InterPro site and looked up (Search -> by Text) each gene by the PFAM ID (PFAM got merged into InterPro with other databases).
  • Then found the corresponding InterPro entry (starts with IPR___)
  • Clicked on 'proteins' for that entry to get a list of all protein seqs in the database. You can filter by 'reviewed' to get better quality ones.
  • Unfortunately I couldn't figure out how to filter by taxonomy, but you could probably do it after downloading. For my purposes, because these genes are well conserved I didn't need a ton of representative sequences. So I just searched for some model organisms: homo sapiens, arabidopsis, aspergillus, c elegans, dictyostelium, paramecium.
  • Copied the sequences into a textfile and ran hmm-build on them.

An important note for ribosomal genes: databases are often not clear on the true orthologs between prokaryotes/eukaryotes. Gene names differ between groups. InterPro has attempted to map them to common entries but there are still discrepancies. In some cases you have to use a separate IPR entry to get the euk ortholog otherwise it will include mitochondria/chloroplast genes.

If it helps, I've attached a spreadsheet with the relevant info: universal_ribosomal_genes.xlsx

These have been validated in downstream analyses (e.g., MSAs).

Hope this helps!

Molly

PS: I'm currently doing a semester abroad in the UK. Interested in learning about Esox and potential bioinformatics job opportunities. Would you be free to connect sometime?

molly-kholodova avatar Nov 22 '24 11:11 molly-kholodova

Awesome, thank you @molly-kholodova! Good to know regarding the IPR entries and great that you labelled them within the spreadsheet.

Have you tried using these HMMs with GToTree? The data I was working with today only included a few eukaryotes, however the tool found no genes for them using the HMMs (incl. a Saccharomyces cerevisiae reference genome).

Yes, it'd be great to have a chat! Please reach out via https://www.esoxbiologics.com/contact and mention my name, then we can use email to arrange a time.

ezherman avatar Nov 22 '24 15:11 ezherman

No problem @ezherman! Appreciate the contact info.

Interesting that you got no hits with the default HMM set, usually the problem is getting too many. S cerivisae has a bunch of ribosomal gene duplications I think.

I have not tried the euk specific ones with GToTree since I was testing another tree software, but I would expect it to work. If hmmsearch is giving you hits for this set but not the other, maybe it's an issue with cutoffs.

molly-kholodova avatar Nov 22 '24 18:11 molly-kholodova

Heya, @ezherman :)

Do you still get no hits if you run it in "best-hit mode" by adding the -B flag? Or just looking at the count summary table produced from earlier runs, is it saying 0 for everything?

AstrobioMike avatar Nov 22 '24 22:11 AstrobioMike

Hi @AstrobioMike!

Yep, I had no hist with -B: all 0s in SCG_hit_counts.tsv

ezherman avatar Nov 26 '24 13:11 ezherman

Ah, i see the deal now, @ezherman. GToTree relies on "gathering thresholds" (GA) to be included in the HMM files like this: image

And Molly was probably using a different method of filtering out spurious hits for these HMMs, as they don't have a GA included.

GAs are a manually curated cutoff, as noted in this article in the "Thresholds and compatibility" section:

The gathering threshold is the bit score threshold that a sequence must match or exceed in order for it to be deemed significant and therefore belonging to that family. In Pfam, we manually define our thresholds such that no known false positives are permitted in any family.

A GA cutoff can just be manually added to any HMM file, it just needs to be empirically derived with some testing to feel confident about the chosen cutoff score.

Sorry for the confusion!

AstrobioMike avatar Nov 26 '24 14:11 AstrobioMike

Hi, sorry to jump in on a somewhat old ticket, I am wondering if perhaps BUSCO (https://busco.ezlab.org/) might be the way for incorporating support for eukaryotes. Basically, they have curated a ton of "core-ish" HMM sets for different lineages, including bacteria and eukaryotes (https://busco-data.ezlab.org/v5/data/lineages/).

The HMMs are not concatenated and don't have GA thresholds in them, but there is a separate "scores_cutoff" file which lists what I presume might be scores useful as GA. Might be a straightforward script to just get them in the format needed for GToTree.

Thank you again for developing GToTree Mike, it is super convenient!

raufs avatar Mar 26 '25 00:03 raufs

Heya, @raufs!

Yea that's a good idea – looking into the eukaryote set and just adding in their provided score cutoffs as GA values should let me make a cat'd HMM ready-to-rock pretty easily like you suggest 👍 I'm a little worried about adding euk-specific support without also adding in a euk-specific gene-caller (as I don't actually know if its really terrible to just use prodigal on euks or not...). But for starters it'd probably be fine to keep things as they are with just some explicit notifications if using the euk-specific set.

I'm currently working on a huge overhaul, so it'll be a little before I get to adding this. But thank you for the idea and thinking through what'd be needed to pull a busco set in! And thanks for the kind words about GToTree 🙂

AstrobioMike avatar Mar 26 '25 01:03 AstrobioMike

That all sounds great! I am also primarily a bacteria person and indeed eukaryotic gene calling is not a fun experience, but there were recent advances for mapping BUSCO to genomes directly (one by the BUSCO folks themselves and one by Huang and Li : https://academic.oup.com/bioinformatics/article/39/10/btad595/7284108). The latter seems to just align proteins used for constructing the HMMs to genomes directly via miniprot (which is amazing and super fast) and then uses the HMMs to confirm / determine which hits are the best. I think newer miniprot versions might also just directly output protein sequences which can then be used for the alignments, but at the very least they give info (including frameshifts) needed to extract out the sequence from the target genome.

raufs avatar Mar 26 '25 01:03 raufs