big.matrix alters R's RNG seeds
If I do the following sequence of functions calls multiple times, the outputs are always the same:
set.seed(123)
x <- rbinom(25, 1, 0.2)
m <- matrix(1, nrow = 5, ncol = 5)
rbinom(25, 1, 0.2)
but if I use big.matrix then the calls are different each time:
set.seed(123)
x <- rbinom(25, 1, 0.2)
m <- big.matrix(nrow = 5, ncol = 5, init = 1)
rbinom(25, 1, 0.2)
Any idea why this is happening? I suspected somewhere in the C++ code a 'random' function is called but I couldn't find anything to that effect.
Can you create a file-backed big.matrix with a binary file name and see if you see the same behavior?
If I remember correctly, the big.matrix example creates a shared resource. Since a name is not provided we create a name using sample. If the name already exists, then we create another one and check to see if it is in use.
The following allows consistency when replicated:
set.seed(123)
x <- rbinom(25, 1, 0.2)
m <- filebacked.big.matrix(nrow = 5, ncol = 5, init = 1, backingfile = "test", descriptorfile = "test.desc")
rbinom(25, 1, 0.2)
but the output is not the same as the plain R matrix call sequence above. The last rbinom calls still don't match. This may be something unavoidable. If so, it has some implications for creating reproducible code to base R functions when using 'random' functions.
Were there any further thoughts regarding this or is this likely an unavoidable consequence of big.matrix objects?
I'm not sure. I would have thought they should be the same. Let's leave it open for now.
You can use a "stealth" sample function that undoes changes to the .GlobalEnv$.Random.seed upon exit, e.g. from https://github.com/HenrikBengtsson/parallelly/blob/037d84d6b9a8f1695328198e32c8284d9362bd88/R/utils.R#L108-L119;
## A version of base::sample() that does not change .Random.seed
stealth_sample <- function(x, size = length(x), replace = FALSE, ...) {
oseed <- .GlobalEnv$.Random.seed
on.exit({
if (is.null(oseed)) {
rm(list = ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
} else {
.GlobalEnv$.Random.seed <- oseed
}
})
sample(x, size = size, replace = replace, ...)
}
If you're willing to introduce a package dependency, there's also withr::with_seed(sample(...)).
Thanks Henrik! I'm not opposed to the extra dependence and will fold into the package.
In addition to original report, I'd like to add the fact that using bigmemory::as.big.matrix induces a similar (but not identical) issue. Running the following snippet four times shows that the output is periodic, repeating the first output on the fourth execution:
set.seed(123)
m1 <- matrix(1)
m2 <- bigmemory::as.big.matrix(m1)
rnorm(1)
From a 'fresh' instance of R the output is always
-0.6250393
0.5539177
0.7799651
However, if one opens up a second instance of R then the output of the second instance will begin from the first instance's next output, then continue with 2 other steps that are unique to the instance. That is, if in instance 1 I run the snipped twice and obtain
-0.6250393
0.5539177
then open up instance 2 and run the snippet, we find
0.7799651
0.3796395
1.005739
upon which instance 2 will begin repeating if instance 1 is untouched. The behaviour is a little more complicated if going back and forth between instances. It seems as if the sequence will take off in the new instance from what the old one is expected to have generated, but I'm not quite sure this fully characterizes the behaviour.