The same `set.seed` but slightly different results on different platforms
I am running the following code on my macOS and Linux (via the posit cloud machine, see the sessionInfo at the end).
install.packages(c("RANN", "nonprobsvy"))
library(RANN)
library(nonprobsvy)
data(jvs)
data(admin)
B <- 50
sample_d <- list()
samples_admin <- list()
res_y <- numeric(B)
formula_y <- single_shift ~ region + private + nace + size
formula_x <- ~ region + private + nace + size
for (i in 1:B) {
set.seed(i)
samples <- sample(1:NROW(admin), size = NROW(admin), replace = T)
sample_d[[i]] <- table(table(samples))
admin_v <- admin[samples, ]
samples_admin[[i]] <- admin_v ## saving datasets for comparison between platforms
m1 <- glm(formula_y, data=admin_v, family=binomial())
pred_nonprob <- predict(m1, admin_v, type = "response")
pred_prob <- predict(m1, jvs, type = "response")
nn_match <- nn2(data = pred_nonprob, query = pred_prob, k = 5)
y_prob_fitted <- apply(nn_match$nn.idx, 1, FUN = function(x) mean(admin_v$single_shift[x]))
res_y[i] <- weighted.mean(y_prob_fitted, jvs$weight)
}
I was expecting to get exactly the same results on both machines but when I run the mean(res_y) and sd(res_y) I got slightly different results which is problematic from the reproduction point of view.
mean(res_y) ## 0.707916329284751 (on macOS) vs 0.707920956236746 (on Linux)
sd(res_y) ## 0.0229188965037077 (on macOS) vs 0.0229194232827521 (on Linux)
The difference for mean is -4.626952e-06 and for the sd is -5.26779e-07. So it is not big but I would expect them to be smaller (say 1e-10, 1e-16).
I am wondering what may be the reason for that? The nn2 function actually sometimes returns a different set of neighbors so this affects the reproduction of results on different platforms.
Session info for macOS
R version 4.4.2 (2024-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.4.1
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] pl_PL.UTF-8/pl_PL.UTF-8/pl_PL.UTF-8/C/pl_PL.UTF-8/pl_PL.UTF-8
time zone: Europe/Warsaw
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] nonprobsvy_0.2.1 survey_4.4-2 survival_3.8-3 Matrix_1.7-3
[5] RANN_2.6.2
loaded via a namespace (and not attached):
[1] codetools_0.2-20 mitools_2.4 doParallel_1.0.17
[4] ncvreg_3.15.0 lattice_0.22-7 splines_4.4.2
[7] iterators_1.0.14 parallel_4.4.2 foreach_1.5.2
[10] DBI_1.2.3 formula.tools_1.7.1 compiler_4.4.2
[13] tools_4.4.2 Rcpp_1.0.14 tcltk_4.4.2
[16] operator.tools_1.6.3 MASS_7.3-65
Session info on Linux (posit cloud)
R version 4.4.3 (2025-02-28)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 20.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3; LAPACK version 3.9.0
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8 LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8
[6] LC_MESSAGES=C.UTF-8 LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
time zone: UTC
tzcode source: system (glibc)
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] nonprobsvy_0.2.1 survey_4.4-2 survival_3.8-3 Matrix_1.7-2 RANN_2.6.2
loaded via a namespace (and not attached):
[1] codetools_0.2-20 mitools_2.4 doParallel_1.0.17 ncvreg_3.15.0 lattice_0.22-6
[6] splines_4.4.3 iterators_1.0.14 parallel_4.4.3 foreach_1.5.2 DBI_1.2.3
[11] formula.tools_1.7.1 compiler_4.4.3 tools_4.4.3 Rcpp_1.0.14 operator.tools_1.6.3
[16] MASS_7.3-64
I would suspect either the presence of ties or locations that are extremely close together. I'm no expert on set.seed but I'm not certain you can expect consistent results cross platform.
More generally it's not clear to me that your example isolates cross-platform variation in the random draws or the glm calls. You may need to run those once and save the resultant R objects and then run the nn2 call on those same saved objects across platforms.
Thank you for your prompt answer. I have checked the data and this is not related to the data or model results.
The samples <- sample(1:NROW(admin), size = NROW(admin), replace = T) code results in exactly the same samples (I have compared the sample_d list objects that I stored in both platforms) and the glm results (i.e., coefficients are the same up to 1e-16).
EDIT:
Ok, so I ran the code on exactly the same datasets generated in the simulation study I got the same differences i.e. -4.626952e-06 for the mean and -5.26779e-07 for the SE (not much but can influence some results).
Hi, I verified that the issue is due to numerical approximation of the glm function (through the glm.control(eps=) argument), not the RANN and the underlying ANN library. So I am closing this issue.