R - Which seed did this split?

Question

Usually we fix a seed number to produce the same split every time we run the code. So the code

set.seed(12345)
data <- (1:100)
train <- sample(data, 50)
test <- (1:100)[-train]

always gives the same train and test sets (since we fixed the seed).

Now, assume that I have a data, train, and test. Is there a way to know which seed number used to produce train and test from data??? Bests.

Can you please clarify: The seed as been set using an integer number within `set.seed`, but you don't know the number? Or the seed has not been set and you are trying to reproduce the random state of the system? — Roland, Sep 16 '16 at 14:49

David Robinson · Answer 1 · 2016-09-16T14:10:04.563

8

It's not possible to know with absolute mathematical certainty: but if you have a suspicion about the range in which the seed lies, you can check every seed in that range by "brute force" and see if it leads to the same result.

For example, you could check seeds from 1 to a million with the following code:

tests <- sapply(1:1e6, function(s) {
  set.seed(s)
  this_train <- sample(data, 50)

  all(this_train == train)
})

which(tests)
# 12345

A few notes:

If your dataset or your sample is much smaller, you will start getting collisions- multiple seeds that give the same output. For example, if you were sampling 5 from 10 rather than 50 from 100, there are 34 seeds in the 1:1e6 range that would produce the same result.
If you have absolutely no suspicion about how the seed was set, you'd have to check from -.Machine$integer.max to .Machine$integer.max, which on my computer requires 4.2 billion checks (that will take a while and you may have to get clever about not storing all results).
If there were random numbers generated after the set.seed(), you'd need to replicate that same behavior in between the set.seed and sample lines in your function.
The behavior of sample after a seed is set may differ in very old versions of R, so you may not be able to reproduce one created on an earlier version

edited Sep 16 '16 at 14:10

answered Sep 16 '16 at 13:30

David Robinson

77,383
16
167
187

What happens if after setting the seed and before extracting the sample a few random numbers are generated? I tried inserting `invisible(runif(50))` before `train <- sample(data, 50)` and wasn't able to find the seed with your procedure. – nicola Sep 16 '16 at 13:44
@nicola Regarding your second comment you're right of course, I was thinking of between older versions of R (added link). In regard to the first you'd have to add those random generations in the function right after the `set.seed`. – David Robinson Sep 16 '16 at 13:52
Indeed. So it must be specified that the above works only if the sample we are interested in has been generated right after setting the seed or if the number of random numbers generated between the seed and the sample is known. In the latter case, you have to add something like `runif(n)` after the `set.seed` part in your solution. – nicola Sep 16 '16 at 14:00
@nicola I looked into it and confirmed that you won't be able to reset the state after several random generation if you don't know what random generations were done in between. From `?set.seed` about the (default) Mersenne-Twister: `A twisted GFSR with period 2^19937 - 1 and equidistribution in 623 consecutive dimensions (over the whole period). The ‘seed’ is a 624-dimensional set of 32-bit integers plus a current position in that set`. This would be impossible to brute force and would have collisions. – David Robinson Sep 16 '16 at 14:02

score 0 · Answer 2 · answered Sep 16 '16 at 13:27

0

No, this is not possible. Multiple seeds can produce the same series of data. It's non-reversible.

answered Sep 16 '16 at 13:27

Sijmen

69
1
1
10

5

It can't be known with perfect, 100% certainty, but the number of possible samples of 50 from 100 (`choose(100, 50)`) is about 10^29. The number of possible seeds is `.Machine$integer.max * 2 + 1` = 4294967295. If they are randomly distributed (they will not be perfect but randomizers are meant to approach that), the possibility of another seed getting the same output is therefore 1/10^19. – David Robinson Sep 16 '16 at 13:33
Fair point. I admit I did not take the time to do the math behind it, just the logics. Thanks for the comment. – Sijmen Sep 16 '16 at 13:40

R - Which seed did this split?

2 Answers2