clusterProfiler icon indicating copy to clipboard operation
clusterProfiler copied to clipboard

Issue defining "universe" in enricher function

Open GretchenSeim opened this issue 5 years ago • 18 comments

Currently trying to perform an enrichment using enricher and a self-curated TERM2GENE list and self defined universe of genes. However even though I am specifying a universe it seems to be pulling the default universe - aka the list of genes in the TERM2GENE list. I have ~8000 genes in my universe and it is just pulling the 30 or so in m TERM2GENE list. Wondering if there is something I'm doing wrong or if there is some sort of bug. There isn't any sort of error message - I can just see in the enrichment result that the universe is much smaller than the 8000 or so genes I'd like it to be.

##defining my universe universe2<-rownames(DEs)

this results in a character vector of 8563

then I run the enricher as such

kk2<-enricher(gene=rownames(cluster2), pvalueCutoff = 0.05, universe=universe2, TERM2GENE=TERM2GENE)

and my enrichment result shows a universe of only 52 - the number of genes in my TERM2GENE list

GretchenSeim avatar Jul 16 '20 01:07 GretchenSeim

I have similar problem - I think perhaps "universe" definition is working in a way I don't understand?

I've defined universe parameter to all protein-coding genes in my annotation (about 20k symbols), and then ran enrichment vs. MsigDB Hallmark dataset. However, the background was only 4313 genes - exactly the size of the intersection between the "universe" vector and all symbols in H database.

This is changing p-values a lot, so it would be nice to sort it out.

apredeus avatar Mar 12 '21 23:03 apredeus

Yes I agree with both of you. The universe is just an intersection between the specified vector and all genes in the "TERM2GENE" list. My guess is that the null distribution of p values are generated from randomly sampling the universe genes and then computing the overlap to genes belonging to a term.

Jeff-Gui avatar Mar 09 '22 06:03 Jeff-Gui

I also have the same problem. I have ~27K genes as universe genes, ~25k of them have GO annotation terms and as TERM2GENE, then I choose ~10k interesting genes to conduct run enricher. However, the denominator of bgratio is ~25k, and the denominator of generatio is ~ 9k which is the intersect of ~10k interesting genes and ~25k have GO terms. May I ask if this is normal and correct?

Weihan

Weihankk avatar Jun 09 '22 07:06 Weihankk

Maybe it's more accurate to limit the background to genes that are mapped to the database? like some say here: https://www.biostars.org/p/439428/

avishai987 avatar Aug 01 '22 12:08 avishai987

Hi, Prof.Yu, I do have the same issue when using enricher to do the over presentation analysis. The current enricher/DOSE::enricher_internal() used modified hypergeometric test with genes in the (TERM2GENE) as background. This is debatable as discussed:

  1. https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009935
  2. https://academic.oup.com/bib/article/22/1/545/5722384

Would be great if we could define our own bg/universe genes in the new versions, rather than the intersect of ourbg with unique(TERM2GENE$gene). If there are really something we need to trim, perhaps we could trim TERM2GENE, removing genes not appeared in the universe, rather than the opposite. As a clusterProfile User, we occasional analyze pathway level expressions using our customized genesets, which might only contain several pathways, and hundreds of genes. universe defined as all genes in DEG analysis, rather than that in TERM2GENE, make more sense.

Best regards,

RaymondSHANG avatar Sep 14 '22 02:09 RaymondSHANG

@RaymondSHANG Please read the doc of enricher():

universe background genes. If missing, the all genes listed in the database (eg TERM2GENE table) will be used as background.

huerqiang avatar Sep 20 '22 04:09 huerqiang

Hi,@huerqiang, Thanks for the reply. I checked the R functions for enricher(). enricher() --> enricher_internal --> DOSE:::enricher_internal(). And, in DOSE:enricher_internal.R, lines 61-68:

if(!is.null(universe)) {
        if (is.character(universe)) {
            extID <- intersect(extID, universe)
        } else {
            ## https://github.com/YuLab-SMU/clusterProfiler/issues/217
            message("`universe` is not in character and will be ignored...")
        }
}

And line 100:

N <- rep(length(extID), length(M))

When you got intersect(extID, universe), even though universe is provided, only the intersect is selected as the background. Based on my current understanding, N should be:

N <- rep(length(universe), length(M))

I think all above posts in the issue are talking about this.

Thank you,

RaymondSHANG avatar Sep 20 '22 17:09 RaymondSHANG

I think recalculating N in DOSE will solve the problem.

N <- rep(length(universe), length(M))

TERM2GENE is a kind of global variable, change TERM2GENE may result changes elsewhere in the whole package.

On Thu, Oct 20, 2022 at 9:46 AM Graeme Grimes @.***> wrote:

Would adding all the genes in the universe as a term in the term2gene database fix this? e.g.

define a TERM2GENE data.frame then add another term including all genes in universe

t2uni<-data.frame(term="universe",gene=universe) TERM2GENE=rbind(TERM2GENE,t2uni)

— Reply to this email directly, view it on GitHub https://github.com/YuLab-SMU/clusterProfiler/issues/283#issuecomment-1285861708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AE6GPCGYR5QDEUMPMXA7QVLWEFZPLANCNFSM4O3IJJ5Q . You are receiving this because you were mentioned.Message ID: @.***>

-- Raymond "Yuan" SHANG, Ph.D.

RaymondSHANG avatar Oct 20 '22 17:10 RaymondSHANG

Hi,

We have also faced with this problem. From our point of view, it would be correct to filter the genesets (gmt) to obtain those genes which are present also in your universe. On the other hand, filtering the universe with the genes in the genesets (intersection between universe and gmt) will greatly affect the p value obtained due to the fact that you are reducing the universe (N) to the geneset collection (eg. GO has many annotated genes ~25K, but theoretically a gmt could have only one geneset). A quick solution for the time being, not the best one though, could be adding your universe to your gmt as a pathway, so all the genes in your universe will be part of the term2gene and the intersection will not affect your universe size nor your significance level.

It would be nice to find a solution to be able to avoid obtaining underestimated significance.

Thank you, Ariadna

ariadnaaterrades avatar Nov 11 '22 13:11 ariadnaaterrades

Hi all,

I am experiencing the same problem at the moment. Is this going to be fixed or is it necessary to adjust manually, as suggested by @RaymondSHANG ?

How did you proceed in this case?

Cheers

christianmaueroeder avatar Oct 26 '23 21:10 christianmaueroeder

@RaymondSHANG I led to the same conclusion. Do you plan a PR to DOSE?

SamGG avatar Mar 06 '24 13:03 SamGG

If you want to keep the universe untouched, aka, not performing intersection with genes that have annotations, you can force it to do so via options(enrichment_force_universe = TRUE).

Please try it and give me feedback if there is something not work as expected.

GuangchuangYu avatar Apr 23 '24 09:04 GuangchuangYu

If you want to keep the universe untouched, aka, not performing intersection with genes that have annotations, you can force it to do so via options(enrichment_force_universe = TRUE).

Please try it and give me feedback if there is something not work as expected.

Thanks for this option, however, I think what is done now is taking all genes that are stored in OrgDb irrespective of the values specified by universe. For example, after using org.Mm.eg.db, I get a background of 28891 genes, thus, all mouse genes and not only the genes that are expressed in my target cells.

christianmaueroeder avatar Apr 23 '24 16:04 christianmaueroeder

> require(clusterProfiler)
> data(geneList, package="DOSE")
> de = names(geneList)[1:100]
> length(geneList)
[1] 12495
> sample(names(geneList), 8000) -> bg
>
> x = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
> head(x, 2)
                   ID                Description GeneRatio  BgRatio
GO:1903047 GO:1903047 mitotic cell cycle process     29/99 390/7388
GO:0000278 GO:0000278         mitotic cell cycle     31/99 465/7388
                 pvalue     p.adjust       qvalue
GO:1903047 9.513903e-15 1.211324e-11 1.126778e-11
GO:0000278 1.856229e-14 1.211324e-11 1.126778e-11
                                                                                                                                                                        geneID    
GO:1903047              8318/55143/991/2305/4605/9833/10403/6241/55165/9787/51203/22974/4751/27338/890/983/4085/9837/5080/81930/3832/2146/64151/9212/1111/9319/3833/51514/6790    
GO:0000278 8318/55143/991/2305/4605/9833/10403/79733/6241/55165/9787/220134/51203/22974/4751/27338/890/983/4085/9837/5080/81930/3832/2146/64151/9212/1111/9319/3833/51514/6790    
           Count
GO:1903047    29
GO:0000278    31

@christianmaueroeder The universe parameter should work for enrichGO as demonstrated above.

GuangchuangYu avatar Apr 24 '24 13:04 GuangchuangYu

Thanks for your feedback. If you don't mind, let's use a reproducible example for discussion.

library(DOSE)
#> 
#> DOSE v3.29.3  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
#> 
#> If you use DOSE in published research, please cite:
#> Guangchuang Yu, Li-Gen Wang, Guang-Rong Yan, Qing-Yu He. DOSE: an R/Bioconductor package for Disease Ontology Semantic and Enrichment analysis. Bioinformatics 2015, 31(4):608-609
library(clusterProfiler)
#> clusterProfiler v4.10.0  For help: https://yulab-smu.top/biomedical-knowledge-mining-book/
#> 
#> If you use clusterProfiler in published research, please cite:
#> T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. The Innovation. 2021, 2(3):100141
#> 
#> Attaching package: 'clusterProfiler'
#> The following object is masked from 'package:stats':
#> 
#>     filter

options(width = 100)  # characters width for reprex

# generate data
data(geneList, package="DOSE")
de = names(geneList)[1:100]
length(geneList)
#> [1] 12495
set.seed(1)
bg = c(de, sample(names(geneList)[-(1:100)], 7900, replace = FALSE))
length(unique(bg))
#> [1] 8000

# universe with intersect
options(enrichment_force_universe=FALSE)  # do intersect
getOption("enrichment_force_universe")
#> [1] FALSE
x = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
#> 
x[1:6, 1:5]
#>                    ID                    Description GeneRatio  BgRatio       pvalue
#> GO:0007059 GO:0007059         chromosome segregation     34/99 205/7404 5.613697e-29
#> GO:0000278 GO:0000278             mitotic cell cycle     44/99 470/7404 3.274607e-27
#> GO:0098813 GO:0098813 nuclear chromosome segregation     30/99 161/7404 4.844323e-27
#> GO:1903047 GO:1903047     mitotic cell cycle process     41/99 400/7404 1.011285e-26
#> GO:0000280 GO:0000280               nuclear division     32/99 207/7404 3.366601e-26
#> GO:0000819 GO:0000819   sister chromatid segregation     27/99 127/7404 6.121863e-26

# no universe
y = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP")
y[1:6, 1:5]
#>                    ID                          Description GeneRatio   BgRatio       pvalue
#> GO:0007059 GO:0007059               chromosome segregation     34/99 424/18870 2.432364e-31
#> GO:0098813 GO:0098813       nuclear chromosome segregation     30/99 312/18870 6.474483e-30
#> GO:0000819 GO:0000819         sister chromatid segregation     27/99 225/18870 1.557846e-29
#> GO:0000070 GO:0000070 mitotic sister chromatid segregation     25/99 184/18870 9.747875e-29
#> GO:0140014 GO:0140014             mitotic nuclear division     28/99 274/18870 1.225729e-28
#> GO:0000280 GO:0000280                     nuclear division     32/99 441/18870 4.745944e-28

# universe without intersect
options(enrichment_force_universe=TRUE)  # no intersect
getOption("enrichment_force_universe")
#> [1] TRUE
z = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
z[1:6, 1:5]
#>                    ID                          Description GeneRatio   BgRatio       pvalue
#> GO:0007059 GO:0007059               chromosome segregation     34/99 424/18870 2.432364e-31
#> GO:0098813 GO:0098813       nuclear chromosome segregation     30/99 312/18870 6.474483e-30
#> GO:0000819 GO:0000819         sister chromatid segregation     27/99 225/18870 1.557846e-29
#> GO:0000070 GO:0000070 mitotic sister chromatid segregation     25/99 184/18870 9.747875e-29
#> GO:0140014 GO:0140014             mitotic nuclear division     28/99 274/18870 1.225729e-28
#> GO:0000280 GO:0000280                     nuclear division     32/99 441/18870 4.745944e-28

# size of the universe is not intersected
options(enrichment_force_universe=FALSE)  # do intersect
options(enrichment_N_from_universe = TRUE)  # size untouched
assignInNamespace("enricher_internal", DOSE:::enricher_internal, getNamespace("clusterProfiler"))

w = enrichGO(de, OrgDb='org.Hs.eg.db', ont= "BP", universe=bg)
#> N_from_universe is 8000 vs 7404
w[1:6, 1:5]
#>                    ID                    Description GeneRatio  BgRatio       pvalue
#> GO:0007059 GO:0007059         chromosome segregation     34/99 205/8000 4.493552e-30
#> GO:0000278 GO:0000278             mitotic cell cycle     44/99 470/8000 1.378001e-28
#> GO:0098813 GO:0098813 nuclear chromosome segregation     30/99 161/8000 5.177092e-28
#> GO:1903047 GO:1903047     mitotic cell cycle process     41/99 400/8000 5.216023e-28
#> GO:0000280 GO:0000280               nuclear division     32/99 207/8000 3.167084e-27
#> GO:0000819 GO:0000819   sister chromatid segregation     27/99 127/8000 8.103928e-27

Created on 2024-04-25 with reprex v2.1.0

When a bg is provided, the bg is filtered before computing the enrichment, here 7404 => x. When no bg is provided, or, a bg is provided and no intersect is set, the full set is used as background, here 18870 => y and z. What I expect is that the bg is taken as is, here 8000 => w.

In the end, on this example, there is no big difference between x and w, except the p-value.

Maybe I am wrong, but let's discuss. @RaymondSHANG @christianmaueroeder @guidohooiveld

To be noticed, the reported length of the gene list is 99 instead of 100.

SamGG avatar Apr 25 '24 18:04 SamGG