bigmemory icon indicating copy to clipboard operation
bigmemory copied to clipboard

big.matrix alters R's RNG seeds

Open cdeterman opened this issue 9 years ago • 7 comments

If I do the following sequence of functions calls multiple times, the outputs are always the same:

set.seed(123)
x <- rbinom(25, 1, 0.2)
m <- matrix(1, nrow = 5, ncol = 5)
rbinom(25, 1, 0.2)

but if I use big.matrix then the calls are different each time:

set.seed(123)
x <- rbinom(25, 1, 0.2)
m <- big.matrix(nrow = 5, ncol = 5, init = 1)
rbinom(25, 1, 0.2)

Any idea why this is happening? I suspected somewhere in the C++ code a 'random' function is called but I couldn't find anything to that effect.

cdeterman avatar May 27 '16 12:05 cdeterman

Can you create a file-backed big.matrix with a binary file name and see if you see the same behavior?

If I remember correctly, the big.matrix example creates a shared resource. Since a name is not provided we create a name using sample. If the name already exists, then we create another one and check to see if it is in use.

kaneplusplus avatar May 27 '16 16:05 kaneplusplus

The following allows consistency when replicated:

set.seed(123)
x <- rbinom(25, 1, 0.2)
m <- filebacked.big.matrix(nrow = 5, ncol = 5, init = 1, backingfile = "test", descriptorfile = "test.desc")
rbinom(25, 1, 0.2)

but the output is not the same as the plain R matrix call sequence above. The last rbinom calls still don't match. This may be something unavoidable. If so, it has some implications for creating reproducible code to base R functions when using 'random' functions.

cdeterman avatar May 27 '16 16:05 cdeterman

Were there any further thoughts regarding this or is this likely an unavoidable consequence of big.matrix objects?

cdeterman avatar Jun 02 '16 13:06 cdeterman

I'm not sure. I would have thought they should be the same. Let's leave it open for now.

kaneplusplus avatar Jun 02 '16 19:06 kaneplusplus

You can use a "stealth" sample function that undoes changes to the .GlobalEnv$.Random.seed upon exit, e.g. from https://github.com/HenrikBengtsson/parallelly/blob/037d84d6b9a8f1695328198e32c8284d9362bd88/R/utils.R#L108-L119;

## A version of base::sample() that does not change .Random.seed
stealth_sample <- function(x, size = length(x), replace = FALSE, ...) {
  oseed <- .GlobalEnv$.Random.seed
  on.exit({
    if (is.null(oseed)) {
      rm(list = ".Random.seed", envir = .GlobalEnv, inherits = FALSE)
    } else {
      .GlobalEnv$.Random.seed <- oseed
    }
  })
  sample(x, size = size, replace = replace, ...)
}

If you're willing to introduce a package dependency, there's also withr::with_seed(sample(...)).

HenrikBengtsson avatar Oct 25 '20 16:10 HenrikBengtsson

Thanks Henrik! I'm not opposed to the extra dependence and will fold into the package.

kaneplusplus avatar Oct 29 '20 15:10 kaneplusplus

In addition to original report, I'd like to add the fact that using bigmemory::as.big.matrix induces a similar (but not identical) issue. Running the following snippet four times shows that the output is periodic, repeating the first output on the fourth execution:

set.seed(123)
m1 <- matrix(1)
m2 <- bigmemory::as.big.matrix(m1)
rnorm(1)

From a 'fresh' instance of R the output is always

-0.6250393
0.5539177
0.7799651

However, if one opens up a second instance of R then the output of the second instance will begin from the first instance's next output, then continue with 2 other steps that are unique to the instance. That is, if in instance 1 I run the snipped twice and obtain

-0.6250393
0.5539177

then open up instance 2 and run the snippet, we find

0.7799651
0.3796395
1.005739

upon which instance 2 will begin repeating if instance 1 is untouched. The behaviour is a little more complicated if going back and forth between instances. It seems as if the sequence will take off in the new instance from what the old one is expected to have generated, but I'm not quite sure this fully characterizes the behaviour.

dfleis avatar Jun 04 '21 02:06 dfleis