12

Using the random class, and a seed of time(NULL) the uniform distribution always gives the same first output, even with different compilings, but after the first output behaves as you would expect a pseudorandom number generator to behave.

Is this by construction, or am I using it incorrectly?

MWE:

#include <ctime>
#include <iostream>
#include <random>

using namespace std;

default_random_engine gen(time(NULL));
uniform_int_distribution<int> dist(10,200);

int main()
{
    for(int i = 0; i < 5; i++)
        cout<<dist(gen)<<endl;

    return 0;
}

The first three times I ran this program I get outputs of:

57
134
125
136
112

Before the second try I decided to delete empty line between uniform_int_distribution and int main() just to see if the seed was based on compile time, as you can see, that didn't matter.

57
84
163
42
146

Just running again:

57
73
181
160
46

So on my runs I keep getting 57 first, which of course isn't the end of the world, if I want different outputs I can throw away the first output. But it brings into question whether this is by design (if so why?) or if I am misusing the generator somehow (if so how?).

user2386276
  • 548
  • 5
  • 19
  • 1
    I am able to replicate the problem on Ubuntu 14.04 with both G++ 4.8.2 and Intel compiler 15.0.0. Interesting... – Nicu Stiurca Oct 20 '14 at 22:26
  • Yes, I've only tried this on my Ubuntu 14.10, g++ 4.8.2 compiler, glad I'm not crazy though. – user2386276 Oct 20 '14 at 22:32
  • 1
    Add a constant like 1000000 to time(NULL) and you will see a different first value. What you see is expected for something like an LCG. – user515430 Oct 20 '14 at 23:03
  • 2
    I am also able to replicate the problem. And when I output the first value of `gen()` from each run, I get a different starting number in each run, so it looks like the problem is in `uniform_int_distribution<>` instead of `default_random_engine`. The upper 12 bits of the first value from `gen()` are always the same (probably because they're the same in `time(NULL)`). A probably related (and unanswered) stackoverflow question: http://stackoverflow.com/questions/21843172/c-uniform-int-distribution-always-returning-min-on-first-invocation – Michael Burr Oct 21 '14 at 08:17
  • 2
    And a quick test shows that if I reverse the bytes in the `time(NULL)` seed, the problem goes away. So `default_random_engine` and/or `uniform_int_distribution<>` seem to be sensitive (I'm sure that isn't the right word) to the state of the high bits of the seed. – Michael Burr Oct 21 '14 at 08:25

3 Answers3

9

I'm not sure what's going wrong (yet!), but you can still initialize by time as follows without hitting the problem (borrowed from here).

#include <ctime>
#include <iostream>
#include <random>
#include <chrono>

using namespace std;

unsigned seed1 = std::chrono::system_clock::now().time_since_epoch().count();

default_random_engine gen(seed1); //gen(time(NULL));
uniform_int_distribution<int> dist(10,200);

int main()
{
    for(int i = 0; i < 5; i++)
        cout<<dist(gen)<<endl;

    return 0;
}

You can also use the random device, which is non-determinstic (it steals timing information from your key strokes, mouse movements, and other sources to generate unpredictable numbers). This is the strongest seed you can choose but the computer's clock is the better way to go if you don't need strong guarantees because the computer can run out of "randomness" if you use it to often (it takes many key strokes and mouse movements to generate a single truly random number).

std::random_device rd;
default_random_engine gen(rd());

Running

cout<<time(NULL)<<endl;
cout<<std::chrono::system_clock::now().time_since_epoch().count()<<endl;
cout<<rd()<<endl;

on my machine generates

1413844318
1413844318131372773
3523898368

so the chrono library is providing a significantly larger number and more rapidly changing number (that's in nanoseconds) than the ctime library. The random_device is producing non-deterministic numbers which are all over the map. So it seems as though the seeds ctime is producing may be too close together somehow and thus map partly to the same internal state?

I made another program which looks like this:

#include <iostream>
#include <random>
using namespace std;

int main(){
  int oldval           = -1;
  unsigned int oldseed = -1;

  cout<<"Seed\tValue\tSeed Difference"<<endl;
  for(unsigned int seed=0;seed<time(NULL);seed++){
    default_random_engine gen(seed);
    uniform_int_distribution<int> dist(10,200);
    int val = dist(gen);
    if(val!=oldval){
      cout<<seed<<"\t"<<val<<"\t"<<(seed-oldseed)<<endl;
      oldval  = val;
      oldseed = seed;
    }
  }
}

As you can see, this simply prints out the first output value for every possible random seed up to the current time along with the seed and number of previous seeds which had the same value. An excerpt of the output looks like this:

Seed  Value Seed Difference
0 10  1
669 11  669
1338  12  669
2007  13  669
2676  14  669
3345  15  669
4014  16  669
4683  17  669
5352  18  669
6021  19  669
6690  20  669
7359  21  669
8028  22  669
8697  23  669
9366  24  669
10035 25  669
10704 26  669
11373 27  669
12042 28  669
12711 29  669
13380 30  669
14049 31  669

So for every new first number there are 669 seeds which give that first number. Because the second and third numbers are different we are still generating unique internal states. I think we would have to understand much more about the default_random_engine in order to understand what is special about 669 (which can be factored into 3 and 223).

Given this, it's clear why the chrono and random_device libraries work better: the seeds they generate are always more than 669 apart. Keep in mind that even if the first number is the same what matters in many programs is that the sequence of numbers generated by distinct.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • okay that's interesting, so its seems to be related to the ctime implementation, though I don't understand much of the use of chrono – user2386276 Oct 20 '14 at 22:15
  • I would favour the random device here, since we'd be using it only once. – E_net4 Oct 20 '14 at 22:33
  • @E_net4 (tagging you in case you could help) Wow, that comment about the random device is awesome, I was not aware it existed, thanks for that. When you say it can run out of randomness if it is used too often, what defines too often? Like I can only use it X times and then I can never ever get a random number again, or like X times a day? or like X times before recompiling? I don't necessarily need specifics, but a general idea of what you meant. – user2386276 Oct 20 '14 at 22:37
  • Your edit about `chrono` being more precise than `ctime` is interesting, I just now ran the code a few more times and it is starting with `10` for all of the runs now. So it appears to update the first term, just slowly. – user2386276 Oct 20 '14 at 22:41
  • Using two different global variables you might run into an initialization order problem. – Mark Ransom Oct 20 '14 at 22:55
  • I definitely agree, the first number is in some sense less important, as long as we are getting other good output, and I agree that definitely makes sense why `chrono` and `random_device` are better. But if you have it, I would still like a bit better of an explanation of how often, roughly, I could use a random_device. Or more specifically what I would do, in order to be able to use it again later; do I just need to wait a day? or just smash my keys more a moment? – user2386276 Oct 20 '14 at 22:57
  • 1
    @MarkRansom, if you look at my second seed-searching program you'll note that it uses no global variables but still demonstrates the OP's problem. Therefore, order of initialization is not at issue here. If you consider OP's program along with my first program, you'll also note that there is _only one_ possible order of initialization for the globals in question. – Richard Oct 20 '14 at 22:58
  • 1
    @user2386276, your computer is continually collecting entropy. So if you run your program not too often than `random_device` will also have random numbers from you. If you run your program often (as in, probably, many times per second) then you will exhaust your entropy supply. You can play with this on Linux by typing `cat /dev/random`. The first time you run this it will likely generate many characters of random stuff. On subsequent runs it will generate fewer characters. If you wait a while and run it again, you'll have rebuilt your store of entropy. – Richard Oct 20 '14 at 23:01
  • 1
    So @user2386276, I wouldn't worry about it too too much. Also, the [documentation](http://www.cplusplus.com/reference/random/random_device/) for `random_device` says it will throw an exception if entropy is unavailable. In that case, you should probably fall back to using the system clock (otherwise this is a great way to DOS your program). Play with `cat`, read up on entropy generation a bit (search around), and eventually you'll have a feel for it. – Richard Oct 20 '14 at 23:03
  • That's exactly what I was looking for, thanks for your help with this problem, been bothering me all day. Plus now I know how to test what the problem is. – user2386276 Oct 20 '14 at 23:03
  • Your first code snippet definitely has the potential for global order initialization problems, that's what I was commenting on - it has nothing to do with OP's original problem or any other code in your answer. – Mark Ransom Oct 20 '14 at 23:06
  • @MarkRansom: Since the OOI of `dist` and `gen` in that snippet cannot be responsible for OP's problem and `gen` cannot be initialized without `seed1` having been initialized I don't see how any problems could arise. – Richard Oct 20 '14 at 23:24
  • What makes you think that `seed1` must be initialized before `gen`? It's not because one depends on the other. – Mark Ransom Oct 21 '14 at 00:36
  • @MarkRansom, given that this discussion has, as you say, "nothing to do with OP's original problem" I don't think it's worth continuing it. If you disagree, I'd suggest that you provide links to specific documentation, examples, or references elucidating the problem you claim exists. There is insufficient room in this margin. – Richard Oct 21 '14 at 02:25
  • I know this is late. I'm not sure there is nothing special about 669 and `default_random_engine`, that was just the seed difference that was produced because of the chosen range `(10,200)`. Indeed for the small range `(0,50)` you get a seed difference of `2505` because a larger seed difference is needed to get a different mapped value in to that smaller range. A bigger range would create a smaller seed difference and so on. If you examine the output of a `default_random_engine` object you see the first value barely changes when the seed does which is the source of the issue. – Silversonic Dec 13 '15 at 15:24
3

Using std::default_random_engine is like saying "Surprise me!" in a bad restaurant. The only thing you know for certain is that the result will be poor - since the generators provided by <random> are all deficient - but you won't even know which particular defects you have to deal with.

The Mersenne Twister can be a decent choice if - and only if - it is properly seeded, and therein lies the rub. Ideally, every bit of the seed should affect every bit of the resulting generator state with equal probability; as you discovered, that is just not the case in common implementations of std::mersenne_twister_engine.

Mersenne Twisters are normally initialised with the output of a simpler PRNG that has in turn been seeded by whatever entropy is available. This effectively stretches the seed entropy of the simpler PRNG over the huge state of the twister. The makers of the standard thoughtfully provided the seed_seq interface for this purpose; however, it seems that the library does not contain any adapters for using a generator as a seed sequence.

There is also the discrepancy between two different conceptions of seeding. On the generator side, the seeding function should take the entropy that was passed in and map it faithfully onto the generator state, ensuring that no entropy is lost in the process. On the user side, it's "Take these numbers and give me wildly different sequences", where 'these numbers' is { 1, 2, 3, ... } or clock() output.

In other words, the seed entropy is offered in a form that is not suitable for initialising generator state directly; small seed differences give small state differences. This is especially problematic with huge lagged generators like the Mersenne Twister or the lagged Fibonacci that powers the std::ranluxXX generators.

A bit-mixing function - a bijective function where every bit of the output depends on every bit of the input with equal probability - can help with making seeds like 1, 2, 3 or clock() output more useful for seeding. The murmur hash mixer comes close to this ideal, by achieving almost perfect diffusion (32-bit version shown):

uint32_t murmur_mix32 (uint32_t x)
{
   x ^= x >> 16;
   x *= 0x85EBCA6B;
   x ^= x >> 13;
   x *= 0xC2B2AE35;
   x ^= x >> 16;

   return x;
}

The function is bijective, hence it does not lose any entropy at all. This means you can use it to improve any seed without any danger of making things worse.

Another quick fix - without the effort of making a seed_seq - is to call discard() on the generator with a parameter that depends on the (murmur-mixed) seed. However, the effect on huge generators like the Mersenne Twister is somewhat limited, since their state evolves extremely slowly and they need hundreds of thousands of iterations for full recovery from deficient states.

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
  • It makes sense that the standard library wouldn't have something that was the best possible, definitely not cryptographically secure, but it gets the job done if you need to roll a dice or something. You spoke on needing to seed the mersenne twister properly, would the `random_device` not be a great choice for this? It is toted as a _true_ random number generator after all. – user2386276 Oct 23 '14 at 00:06
  • All of the generators in the standard library fail standard PRNG tests like [TestU01](http://simul.iro.umontreal.ca/testu01/tu01.html) *systematically*, except for the ranlux generators with a 'large enough' luxury factor. You have to be circumspect with these generators and choose them carefully. What I find odd is that the standard does not contain any good choice at all apart from the simple LCG which will probably always have its uses (as long as you understand the caveats). What happened to "at least acceptable engine behavior for relatively casual, inexpert, and/or lightweight use"? – DarthGizka Oct 24 '14 at 14:18
  • random_device is not guaranteed to be available; otherwise it would be a good choice for obtaining a random seed. However, one should probably resist the temptation of making a seed_seq by wrapping a random_device, as that would make program runs non-repeatable. Better to extract a reasonable fixed quantity of bits - 32, 64 or 256 as the case may be - that can be printed to the output or a log and later fed to the program if a repeat run is desired. The fixed amount of entropy can then be made into a seed_seq of the desired size using a simple, robust generator like xorshift*. – DarthGizka Oct 24 '14 at 14:23
  • Unfortunately, the standard does not contain any simple, robust generators. Your best bet would probably be ranlux24; however, it is neither simple nor robust and in fact it is itself one of the generators for which one would normally employ a seed generator. The LCG generators are simple enough but the periodicity of their output bits (lowest bit has period 2, the two lowest together have period 4 and so on) means that there is potential for unforeseen interaction with the generator to be seeded. If I were stuck with the library I'd take an LCG as seed rng and pass output through murmur_mix() – DarthGizka Oct 24 '14 at 14:33
0

the seed you are using could be introducing a bias, if using a different seed yields the same results then the generator itself is not properly written.

I would suggest testing with different seeds to draw a conclusion.

Farai
  • 180
  • 2
  • 12
  • That's a good idea. I'll admit to not knowing how to seed the generator a different way. I can just feed it numbers (like 10 or 127) but that will obviously give the exact same output everytime. Richard's comment above seems to suggest that it is the seed that is introducing bias. – user2386276 Oct 20 '14 at 22:17