FastOMA icon indicating copy to clipboard operation
FastOMA copied to clipboard

gene with Expansion & Contraction

Open Johnsonzcode opened this issue 1 year ago • 6 comments

Hi @sinamajidian I am finishing this pipeline with 8 species and got the result from the HTML image

so how do I fetch the genes with Expansion & Contraction from the result files? I am a newbie in this field, thank you for your time!

Johnsonzcode avatar Jan 11 '25 04:01 Johnsonzcode

Hi! please take a look at this tutorial notebook and let me know if you have any questions.

sinamajidian avatar Jan 11 '25 04:01 sinamajidian

The tutorial you sent me seems like report.html, but it didn't show me how to retrieve the genes belonging to gains/duplications/losses for a specific species. I can find the code like this

phylostratigraphy = os.path.join(output_folder, "phylostratigraphy.html")
treeprofile = ham_analysis.create_tree_profile(outfile=phylostratigraphy)

from IPython.display import IFrame
IFrame(os.path.basename(phylostratigraphy), width=800, height=600)

but it only shows the result of the genes' number of gains/duplications/losses.

I would like to retrieve the genes belonging to gains/duplications/losses for a specific species.

Thank you !

Johnsonzcode avatar Jan 11 '25 08:01 Johnsonzcode

After exploring the pyHam, I found following code would help me, but I can't still get duplicated genes and lost genes

from pyham import Ham

# 加载物种树和 OrthoXML 文件
species_tree_path = "species_tree_checked.nwk"
orthoxml_path = "FastOMA_HOGs.orthoxml"

# 初始化 pyHAM 分析对象
ham_analysis = Ham(species_tree_path, orthoxml_path)

# 获取目标物种的基因组对象
target_species = "Anas_platyrhynchos"
target_genome = ham_analysis.get_extant_genome_by_name(target_species)

# 获取祖先基因组对象
ancestral_genome = ham_analysis.get_ancestral_genome_by_name("Anas_platyrhynchos/Anser_cygnoides")

# 比较基因组并获取进化事件的基因名称
comparison = ham_analysis.compare_genomes_vertically(target_genome, ancestral_genome)

# 获取基因扩增事件的基因名称
gained_genes = comparison.get_gained()
gained_gene_names = [gene.prot_id for gene in gained_genes]

for gene_id in gained_gene_names:
    print(gene_id)

But when I tried the same way about duplicated genes and lost genes, It didn't work.

>>> duplicated_gene_names = [gene.prot_id for gene in duplicated_genes]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'HOG' object has no attribute 'prot_id'

Why did the same type of result, in the same way didn't work?

Johnsonzcode avatar Jan 11 '25 11:01 Johnsonzcode

Sorry I should have mentioned the specific location in the notebook. There are few points here to clarify here. Firstly these are definitions of reported values in the pyham phylostratigraphy:

- lost: numbers of HOGs that have been lost in between this node and its parent.
- gain: numbers of HOGs that have "emerged" at this node.
- retained: numbers of HOGs that have stay retained (in term of copy numbers) in between this node and its parent.
- - dupl: numbers of HOGs that have arose by duplication in between this node and its parent. 

we can think of a HOG as an ancestral gene. When there are 10 HOGs in an ancestral (internal) node, it means the ancestral genome has 10 ancestral genes.

Consider comparing two ancestral nodes:

ancestral_genome_10 = ham.get_ancestral_genome_by_name("internal_NODE_10")
ancestral_genome_11 = ham.get_ancestral_genome_by_name("internal_NODE_11")
vertical_ = ham.compare_genomes_vertically(ancestral_genome_10, ancestral_genome_11) # The order doesn't matter!
gained= vertical_.get_gained()
lost= vertical_.get_lost()
duplicated= vertical_.get_duplicated()
print(list(gained)[0], list(dup)[0], list(duplicated)[0])

<HOG(HOG:0000616_11)> <HOG(HOG:0014967.1c.5b_59)> <HOG(HOG:0000037.1d.9b.2a_59)>

So, all the elements are HOGs. Note that a hog has descendant genes that can be accessed with [descendant_gene.prot_id for descendant_gene in hogi.get_all_descendant_genes()].

Now let's compare an extant genome with ancestral genome (similar to your code).

ancestral_genome_10= ham.get_ancestral_genome_by_name("internal_node10")
extant_genome = ham.get_extant_genome_by_name("extant_human")
comparison = ham.compare_genomes_vertically(ancestral_genome_10,extant_genome)
gained= comparison.get_gained()
lost= comparison.get_lost()
duplicated= comparison.get_duplicated()
print(list(gained)[0], list(dup)[0], list(duplicated)[0])

Gene(1000000022)  <HOG(HOG:0014967.1c.5b_59)>  <HOG(HOG:0014967.1c.5b_59)>

The gained ones correspond to the child node, which is extant species, so they are reported as extant genes.

So, it depends whether both elements of comparison are ancestral genome or only one of them. When one is extant, you can access to its extant genes directly. When both are ancestral genomes, the ancestral genes of each are hog at that level.

Pro tip: number of duplication events is different than number of duplicated genes. A duplication event can results in one or more genes in the pyham framework.comparison.get_duplicated() is a dictionary {<HOG(HOG:0000037.1d.9b.2a_59)>: [<HOG()>, <HOG()>], <HOG(HOG:0000064.4e_59)>: [<HOG()>, <HOG()>], showing such relationship. The number of keys is the number of duplication events. The sum of length of all values (python dic) is the total number of duplicated genes (=number of genes resulting from duplication, reported in phylostratigraphy).

It looks a bit complicated. Please let me know which part is not clear.

sinamajidian avatar Jan 13 '25 20:01 sinamajidian

OK, I see. When a gene is not extant in the species, I can access it with [descendant_gene.prot_id for descendant_gene in hogi.get_all_descendant_genes()]. vertical_.get_xx() could only obtain the events number rather than the gene number.

Thank you for your detailed explanation!

Johnsonzcode avatar Jan 14 '25 02:01 Johnsonzcode

By the way, what's the difference between this methodology and CAFE5 when talking about Expansion & Contraction?

Johnsonzcode avatar Jan 14 '25 02:01 Johnsonzcode