Non-deterministic output with seed set, `nproc = 1`, and newest version
Hi! Thanks for putting this package together. I'm running GSEA on a series of perturbations like so:
fgsea.output.list <- list()
for (name in names(ranked_gene_set_list)) {
gsea_results <- fgsea(pathways = msigdbr_list, stats = ranked_gene_set_list[[name]],
maxSize = 500, eps = 0.0, nproc = 1) # allow arbitrarily low p-values
fgsea.output.list[[name]] <- gsea_results
}
I've noticed that despite setting a seed, the size of fgsea.output.list changes every time I run the program. Of course the results don't truly differ by much, but I would love for it to be exactly reproducible, especially when I'm setting a seed. Any tips? I'm using version 1.27.1.
@chriswzou fgsea results are supposed to be reproducible, so this looks like a bug. However, we would need more info from you. Specifically, can you provide a minimal example with the corresponding where you have this problem? It doesn't need to be super small, but we should be able to run it on our side and compare the results. Include set.seed command and show which results are different from run to run.
Hi, I think I am encountering the same issue. I am getting a different number of significantly enriched pathways with each run. This occurs with or without set.seed() as well as with or without nproc = 1. On my own data, I am seeing anywhere from 110 - 130 pathways returned as being significant (padj < .05) when using reactome gene sets.
library(fgsea)
data(exampleRanks)
data(examplePathways)
set.seed(20)
fgsea_results <- list()
for (i in 1:10) {
fgsea_result <- fgsea(
pathways = examplePathways,
stats = exampleRanks,
eps = 0.0,
minSize = 1,
maxSize = 1000,
nproc = 1)
fgsea_results[[i]] <- fgsea_result
}
sig.pvals <- lapply(fgsea_results, function(df) {
sum(df$pval < 0.05)
})
The output:
> print(sig.pvals)
[[1]]
[1] 287
[[2]]
[1] 301
[[3]]
[1] 293
[[4]]
[1] 298
[[5]]
[1] 292
[[6]]
[1] 294
[[7]]
[1] 298
[[8]]
[1] 303
[[9]]
[1] 301
[[10]]
[1] 300
> sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: x86_64-suse-linux-gnu (64-bit)
Running under: openSUSE Tumbleweed
Matrix products: default
BLAS: /usr/lib64/R/lib/libRblas.so
LAPACK: /usr/lib64/R/lib/libRlapack.so; LAPACK version 3.11.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: America/New_York
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] fgsea_1.28.0
loaded via a namespace (and not attached):
[1] vctrs_0.6.5 cli_3.6.2 rlang_1.1.3
[4] cowplot_1.1.2 generics_0.1.3 jsonlite_1.8.8
[7] data.table_1.14.10 glue_1.7.0 colorspace_2.1-0
[10] scales_1.3.0 fansi_1.0.6 grid_4.3.2
[13] munsell_0.5.0 tibble_3.2.1 lifecycle_1.0.4
[16] compiler_4.3.2 dplyr_1.1.4 codetools_0.2-19
[19] Rcpp_1.0.12 pkgconfig_2.0.3 fastmatch_1.1-4
[22] BiocParallel_1.36.0 lattice_0.21-9 R6_2.5.1
[25] tidyselect_1.2.0 utf8_1.2.4 pillar_1.9.0
[28] parallel_4.3.2 magrittr_2.0.3 Matrix_1.6-1.1
[31] tools_4.3.2 gtable_0.3.4 ggplot2_3.4.4
@macsennett your example works as it is supposed to work: it is reproducible, that is you can run the same code again and will get the same sig.pvals vector. You are getting different numbers in sig.pvals because calling fgsea changes random state, so it's different before each fgsea call.
The following code produces the same number of 287 significant hits (note how set.seed command is just before fgsea call):
library(fgsea)
data(exampleRanks)
data(examplePathways)
fgsea_results <- list()
for (i in 1:10) {
set.seed(20)
fgsea_result <- fgsea(
pathways = examplePathways,
stats = exampleRanks,
eps = 0.0,
minSize = 1,
maxSize = 1000,
nproc = 1)
fgsea_results[[i]] <- fgsea_result
}
sig.pvals <- lapply(fgsea_results, function(df) {
sum(df$pval < 0.05)
})