fgsea icon indicating copy to clipboard operation
fgsea copied to clipboard

Non-deterministic output with seed set, `nproc = 1`, and newest version

Open chriswzou opened this issue 2 years ago • 3 comments

Hi! Thanks for putting this package together. I'm running GSEA on a series of perturbations like so:

fgsea.output.list <- list()
for (name in names(ranked_gene_set_list)) {
  gsea_results <- fgsea(pathways = msigdbr_list, stats = ranked_gene_set_list[[name]],
                        maxSize = 500, eps = 0.0, nproc = 1) # allow arbitrarily low p-values
  fgsea.output.list[[name]] <- gsea_results
}

I've noticed that despite setting a seed, the size of fgsea.output.list changes every time I run the program. Of course the results don't truly differ by much, but I would love for it to be exactly reproducible, especially when I'm setting a seed. Any tips? I'm using version 1.27.1.

chriswzou avatar Jan 22 '24 09:01 chriswzou

@chriswzou fgsea results are supposed to be reproducible, so this looks like a bug. However, we would need more info from you. Specifically, can you provide a minimal example with the corresponding where you have this problem? It doesn't need to be super small, but we should be able to run it on our side and compare the results. Include set.seed command and show which results are different from run to run.

assaron avatar Jan 22 '24 19:01 assaron

Hi, I think I am encountering the same issue. I am getting a different number of significantly enriched pathways with each run. This occurs with or without set.seed() as well as with or without nproc = 1. On my own data, I am seeing anywhere from 110 - 130 pathways returned as being significant (padj < .05) when using reactome gene sets.

library(fgsea)

data(exampleRanks)
data(examplePathways)

set.seed(20)
fgsea_results <- list()

for (i in 1:10) {
  fgsea_result <- fgsea(
    pathways = examplePathways,
    stats = exampleRanks,
    eps = 0.0,
    minSize = 1,
    maxSize = 1000,
    nproc = 1)

  fgsea_results[[i]] <- fgsea_result
}

sig.pvals <- lapply(fgsea_results, function(df) {
  sum(df$pval < 0.05)
})

The output:

> print(sig.pvals)
[[1]]
[1] 287

[[2]]
[1] 301

[[3]]
[1] 293

[[4]]
[1] 298

[[5]]
[1] 292

[[6]]
[1] 294

[[7]]
[1] 298

[[8]]
[1] 303

[[9]]
[1] 301

[[10]]
[1] 300
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed

Matrix products: default
BLAS:   /usr/lib64/R/lib/libRblas.so 
LAPACK: /usr/lib64/R/lib/libRlapack.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: America/New_York
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fgsea_1.28.0

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5         cli_3.6.2           rlang_1.1.3        
 [4] cowplot_1.1.2       generics_0.1.3      jsonlite_1.8.8     
 [7] data.table_1.14.10  glue_1.7.0          colorspace_2.1-0   
[10] scales_1.3.0        fansi_1.0.6         grid_4.3.2         
[13] munsell_0.5.0       tibble_3.2.1        lifecycle_1.0.4    
[16] compiler_4.3.2      dplyr_1.1.4         codetools_0.2-19   
[19] Rcpp_1.0.12         pkgconfig_2.0.3     fastmatch_1.1-4    
[22] BiocParallel_1.36.0 lattice_0.21-9      R6_2.5.1           
[25] tidyselect_1.2.0    utf8_1.2.4          pillar_1.9.0       
[28] parallel_4.3.2      magrittr_2.0.3      Matrix_1.6-1.1     
[31] tools_4.3.2         gtable_0.3.4        ggplot2_3.4.4 

macsennett avatar Jan 23 '24 16:01 macsennett

@macsennett your example works as it is supposed to work: it is reproducible, that is you can run the same code again and will get the same sig.pvals vector. You are getting different numbers in sig.pvals because calling fgsea changes random state, so it's different before each fgsea call.

The following code produces the same number of 287 significant hits (note how set.seed command is just before fgsea call):

library(fgsea)

data(exampleRanks)
data(examplePathways)

fgsea_results <- list()

for (i in 1:10) {
  set.seed(20)
  fgsea_result <- fgsea(
    pathways = examplePathways,
    stats = exampleRanks,
    eps = 0.0,
    minSize = 1,
    maxSize = 1000,
    nproc = 1)

  fgsea_results[[i]] <- fgsea_result
}

sig.pvals <- lapply(fgsea_results, function(df) {
  sum(df$pval < 0.05)
})

assaron avatar Jan 23 '24 23:01 assaron