1

So, I have to generate a int vector of length N in which only n casual elements are 1, the others are 0. In order to do this, I created a vector of length N, initialized it so that the first n elements are 1 and the others are 0, and then started to shuffle it with a simple function which takes a vector of int, generates random numbers from 0 to N and reshuffles the vector following the output of rand. My question is: how many times can I "trust" the random generator to give me different sequences of numbers so that I can get every time a different vector? If I run this function, let's say, 1 million times, do I obtain 1 million different combinations (provided that there are more than 1m different ways to reorder my vector)? If not, how should I proceed? There's someway to check whether or not I'm generating a sequence that has been previously generated?

Edit: Regarding the possible Flawedness of the algorithm, here it is (I do srand(time(NULL)) only once in my main function, before calling this one):

 void Shuffle(int vector[],int n, int N)
 {
  int i = 0;
  int j = 0;
  for(i=0;i<n;i++)
     {
      j = rand() % N;
      if(j!=i)
         swap(&vector[i],&vector[j]);
     }
  }

Where swap is a function to swap the elements of a vector. I don't see why should be flawed. Do i get some result more likely than others? I know the Fisher-Yates Shuffle algorithm, I just wrote this to save some time on execution...

  • 2
    Sounds like you have a bug in your shuffle. https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle and in particular https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle#Implementation_errors – Karoly Horvath Aug 18 '16 at 15:38
  • If you are using rand to generate a permutation by calling it N times and using those random values as indices of the output order, yes that's a bug. It's unclear exactly what you mean by "reshuffle". – Mark Lakata Aug 18 '16 at 15:42
  • My function does that: it's a for on the first n elements, for every element generates a random number from 0 to N-1, and if that number it's different from the position of the element, switch the two positions. Is it correct? – Francesco Di Lauro Aug 18 '16 at 15:48
  • This isn't just due to the total sequence length of your RNG. It's entirely possible to have repeated subsequences, even if the total sequence length is way higher than you need. – Oliver Charlesworth Aug 18 '16 at 15:58

5 Answers5

3
  1. As mentioned in a comment, that shuffle algorithm is flawed. You should use the Fisher-Yates shuffle. The proof that the algorithm is biased is relatively simple: Consider the probability that the sequence of 1s and 0s is unaltered by the algorithm. That will happen if every one of the n random numbers selected is less than n, which has a probability of (n/N)n or nn/Nn. The correct probability is 1/(N choose n), which is n!/(N×(N-1)×…(N-n+1)). If n is small relative to N, the latter expression is quite close to n!/Nn. Since nn is quite a lot bigger than n!, the probability of the algorithm producing the unaltered sequence is much larger than it should be. (Sequences with most but not all of the 1s in their original spots are also over-produced, but not as dramatically.)

  2. You should never call srand more than once in any program (unless you really know what you are doing). srand seeds the random number generator; after seeding, you should just call rand each time you need a new number. (This point is motivated by the title for the question, and the fact that using srand incorrectly seems to be very common.)

  3. The standard C library rand function does not provide any guarantees about quality, and some implementations have distressingly small ranges and short cycles. But they are probably good enough to do even a million random shuffles.

  4. Even if your random number generator produced distinct sequences every time, and even if you fixed your shuffle function to do a proper Knuth-Yates shuffle, you will still get repeats because the vector you are shuffling has repeated values. As a result, two different shuffles can produce the same sequence. Consider the simple case where n is 2, so your initial vector is a two 1s followed by N-2 0s. The two 1s are indistinguishable from each other, so if your first step swaps to position k and your second one to position l, that will produce exactly the same result as swapping first to l and then to k.


I think what you really want to do is construct a random combination of n out of N objects. There are N choose n such possible combinations; ideally, each such combination should be generated with equal probability.

Below are a few algorithms which accomplish this. All of them are O(N) time, because it is impossible to fill in a boolean vector of length N in less than linear time. However, if you could live with just the list of indices of the 1s, then the second algorithm is O(n), or O(n log n) if you need the indices to be sorted. A true O(n) algorithm which produces indices in sorted order can be found in the paper referenced in this answer, and that might be appropriate if N is very large and n is reasonably small.

The following function is used by several of the algorithms. It could be improved, as its comment indicates, but it will work fine with a good RNG. rand() is not a good RNG.

/* This is not a good implementation of rand_range
 * because some rand() implementations exhibit poor randomness
 * of low-order bits. (And also the bias issue if RAND_MAX is
 * small.) Better random number generators exist :) */
/* Produces a random integer in the half-open range [lo, hi) */
int rand_range(int lo, int hi) {
  return lo + rand() % (hi - lo);
}

1. Reservoir sampling

A simple algorithm which works with large sample sizes is reservoir sampling:

/* vec must be a vector of size at least N. Randomly
 * fills the vector with n 1s and N-n 0s.
 */ 
void random_fill(int vec[], int N, int n) {
  int i;
  for (i = 0; n; ++i) {
    if (rand_range(0, N-i) < n) {
      vec[i] = 1;
      --n;
    }
    else
      vec[i] = 0;
  }
  for (; i < N; ++i) vec[i] = 0;
}

2. Shuffling indices

Another possibility is to generate the indices of the 1s by doing a prefix shuffle on the list of indices:

int random_fill(int vec[], int N, int n) {
  /* For simplicity, use a temporary vector */
  int* inds = malloc(N * sizeof *inds);
  for (int i = 0; i < N; ++i) inds[i] = i;
  for (int i = 0; i < n; ++i) {
    int j = rand_range(i, N);
    int t = inds[j]; inds[j] = inds[i]; inds[i] = t;
  }
  for (int i = 0; i < N; ++i) vec[i] = 0;
  for (int i = 0; i < n; ++i) vec[inds[i]] = 1;
  free(inds);
}

3. Select from the enumerated sequence

If N choose n is not too big (that is, you can compute it without integer overflow), one way of generating a random sequence is to choose a random integer less than N choose n and then produce the combination with that ordinal using some enumeration of possible sequences. (If you are using rand(), you should be aware that even if N choose n is computable without overflow, it might still be greater than RAND_MAX, in which case rand() will not generate the full range of possible ordinals.)

The above reservoir sampling algorithm can be adapted directly to produce an enumeration:

/* Fills vec with the kth sequence of n 1s and N-n 0s, using
 * an unspecified ordinal sequence; every value of k between 0
 * and (N choose n) - 1 produces a distinct sequence.
 */
void ordinal_fill(int vec[], int N, int n, int k) {
  for (int i = 0; N; ++i, --N) {
    int r = (k * n) % N;
    if (r < n) {
      vec[i] = 1;
      k = (k * n) / N;
      --n;
    } else {
      vec[i] = 0;
      k = (k * (N - n)) / N;
    }
  }
}

(live on ideone)

The program above does not make any assumptions about the ordinal value other than that it is positive and fits in an integer. In effect, it will be taken modulo N choose n, although that value is never explicitly computed. If you used uint64_t instead of int and a random number generator which could produce random numbers in a large range, you could generate random sequences by just feeding the function a random number. Of course, this wouldn't guarantee that the sequences were unique.

In essence, the function works by using the ordinal value (k) as the source of the "random" numbers required by the reservoir sampling algorithm. Every ordinal number (mod N choose n) corresponds to a different sequence (proof left as an exercise). Because the ordinal space is partitioned by modulo rather than magnitude, the ordinal sequence is probably not particularly useful as a sequence, but it is guaranteed to be a complete sequence.

Partitioning by magnitude (using something like the combinatorial numbering system) might be faster -- it doesn't require division, for example -- but it would require efficient access to binomial numbers, which the above function does not need. If the binomial coefficients are computed for each step, then a division will be needed, removing a lot of the speed advantage.

Community
  • 1
  • 1
rici
  • 234,347
  • 28
  • 237
  • 341
  • This is a good implementation, but what bothers me is that potentially could take a long time when I run it. I mean that it could take (N-n)n steps (if I'm not mistaken) to give me the vector. The algorithm I'm using it's done in only n steps. – Francesco Di Lauro Aug 18 '16 at 17:17
  • @FrancescoDiLauro: Yes, but this one is unbiased. A very simple test will show the bias in your algorithm. Faster algorithms exist but they rely on numerical analysis which is too complicated to include here. – rici Aug 18 '16 at 17:22
  • @FrancescoDiLauro: Added an indirect reference to a paper which provides a faster algorithm. Here's a simple proof of the bias in your algorithm: the probability that the "shuffle" results in the original vector is (n/N)^n. But it should be 1/(N choose n). Say n is 2; the first expression is 4/N^2 and the second one is 2/(N(N-1)), or about 2/N^2, so the original vector is about twice as likely as it should be. – rici Aug 18 '16 at 17:26
  • On 2nd thought I agree, but it is the names `lo, hi` that threw me off. The names sound like it should be `[lo, hi]` inclusive both ends. Perhaps a name different than `hi`? (Your added comment will do - UV) – chux - Reinstate Monica Aug 18 '16 at 17:37
  • @FrancescoDiLauro: OK, added another algo. I don't think it will actually be any faster in practice. – rici Aug 18 '16 at 17:37
  • Another way to do it could be simply initialize to 0 the vector, then generate n random numbers (check if they are different) from 0 to N and then change the values to 1? I think this is a good and simple unbiased solution.... – Francesco Di Lauro Aug 18 '16 at 17:39
  • @FrancescoDiLauro: That is unbiased but it gets slow when n is close to N, because you end up rejecting a lot of random numbers. The index shuffle I propose differs only in the initialization. – rici Aug 18 '16 at 17:40
  • All of these algos use N steps, including initialization. The first one needs N-n random numbers; the second one needs only n. None of them need (N-n)n steps; perhaps that was a typo (I didn't notice it earlier). – rici Aug 18 '16 at 17:55
  • Your first algo says you could have "low bit bias". Could you elaborate? Do you mean for small values of N? – Francesco Di Lauro Aug 18 '16 at 18:06
  • @FrancescoDiLauro: It's not the algo; it's the function which produces random numbers in a range. Some rand() implementations are [LCGs](https://en.wikipedia.org/wiki/Linear_congruential_generator), which exhibit very poor randomness of the low-order bits. – rici Aug 18 '16 at 18:59
  • It can be even simpler, generate the array of zeroes and ones 00...01...11, and shuffle it based on some seed, using knuth shuffle. – 2501 Aug 19 '16 at 15:44
  • @2501: That's essentially what my first algorithm does. It is even simpler, because it doesn't require a swap; it knows the probability of getting a 1 or a 0 without having to look at a random place in the vector. – rici Aug 19 '16 at 16:12
  • Yeah, it is very clever. Now if we could only get rid of that modulo op. – 2501 Aug 19 '16 at 16:20
  • @2501: if you know that the range of the RNG is a power of two, it is easy to partition the range using a multiply and a shift. Another argument for using an RNG with more precise specifications. – rici Aug 19 '16 at 18:00
  • I do like the last algo you used, it's O(N) and i do have to generate only one random number per vector. I think I'll go for it! – Francesco Di Lauro Aug 20 '16 at 23:01
  • @rici Also, there's a typo in the first algo, i should go from 0 to N – Francesco Di Lauro Aug 20 '16 at 23:25
  • @FrancescoDiLauro: If you are referring to `for (i = 0; n; ++i)`, then that is not a typo. The loop terminates when `n` is reduced to 0; at that point `i` is certainly not greater than `N` (unless `n` were greater than `N` to start with, which is an invalid input). The second loop terminates when `i` reaches `N`. (The third algorithm could have been optimized in the same way. I just didn't bother.) – rici Aug 20 '16 at 23:38
  • @francesco: i just saw your comment where you say that typical values of N and n are 500 and 120. That is way too big for the enumeration algorithm, even with 64-bit integers. – rici Aug 21 '16 at 03:03
  • Let's say n=1 and N=100. First iteration: i=0, the algo generates a number, let's suppose it's greater than 1, so the algo puts vec[0]=1. Then i becomes 1, and again let's say the rand function generates another number greater than 1, there's another 0, then the for cycle ends (i becomes = n) and so the vec is filled with 0. No place for the 1?Am i missing something? – Francesco Di Lauro Aug 21 '16 at 21:06
  • @FrancescoDiLauro: You are not exactly missing something; you are inventing something. The loop condition is `n`, not `i < n`. Every time a one is inserted, `n` is decremented so `n` is always the number of ones left to insert and the loop terminates when no more ones are needed. When `n` is decremented, the probability of selecting a one also changes; that probability is always the number of 1s needed divided by the number of slots left to fill. – rici Aug 21 '16 at 23:09
  • Forgive me, but i read that the cycle is on i, so could potentially become i =n without n lowering, so it end cycle without putting ones... sorry I don't get it. It's the ++i condition instead of i++? I'm totally lost here :-) – Francesco Di Lauro Aug 22 '16 at 08:20
  • Sorry I'm an idiot, for some reason I've read for(i=0;i – Francesco Di Lauro Aug 22 '16 at 08:53
2

Number all the possible sequences (read the enumeration part of the Wikipedia article about combinations), then select each one sequentially (optionally after randomizing).

#1   - 1 1 1 1 1 1 1 1 0 0 0
#2   - 1 1 1 1 1 1 1 0 1 0 0
#3   - 1 1 1 1 1 1 1 0 0 1 0
#4   - 1 1 1 1 1 1 1 0 0 0 1
#4   - 1 1 1 1 1 1 0 1 1 0 0
...
#165 - 0 0 0 1 1 1 1 1 1 1 1
pmg
  • 106,608
  • 13
  • 126
  • 198
  • +1, but I have to admit it's not obvious to me how you'd programmatically go from the ordinal to the specific sequence... – Oliver Charlesworth Aug 18 '16 at 16:20
  • @OliverCharlesworth You mean how to iterate efficiently preserving n digits? – 2501 Aug 18 '16 at 16:45
  • @2501: I mean given e.g. `4`, how do you get to `1 1 1 1 1 1 1 0 0 0 1` (in the above example) without iterating over all previous sequences? – Oliver Charlesworth Aug 18 '16 at 16:50
  • @2501: You use the combinatorial number system, as mentioned in that wikipedia page, to decompose the ordinal. Or you can use the reservoir sampling algorithm, using the ordinal to generate the successive random numbers. – rici Aug 18 '16 at 17:05
  • @rici Of course sampling is really easy, that should work. Here let me push you over 100k :-) – 2501 Aug 18 '16 at 18:59
  • @rici - Reservoir sampling - isn't that back to the OP's original problem? (How to guarantee unique random number sequences.) – Oliver Charlesworth Aug 18 '16 at 19:50
  • @OliverCharlesworth: I added the ordinal->sequence via reservoir sampling algorithm to my [answer](http://stackoverflow.com/a/39022796/1566221). – rici Aug 19 '16 at 16:08
1

You can calculate the CRC of your sequence, which will be sensitive to the order. Find a public domain implementation of CRC32 and save the 32 bit value for each sequence. If the CRCs differ, the sequence is different. If the CRCs are the same, they are probably the same (there is a 1/4 billion chance they will have the same CRCs but different sequences).

Mark Lakata
  • 19,989
  • 5
  • 106
  • 123
0

The period length of a lot of PRNGs is know, see for example the thesis of Amy Glen "On the Period Length of Pseudorandom Number Sequences" [2002]. The period of the actual rand() implementation of your LibC is unknown, you need to look it up in the sources (e.g.: glibc-2.22/stdlib/rand_r.c) and compute it yourself (howto in the thesis above) or the documentation (unlikely).

Please be aware that you need a period of x * N with x the number of freshly shuffled vectors and N the length of that vector and also, obviously, that the chance that two shuffles result in the same outcome is inversely proportional to the size of N, that is, the smaller N the higher the chance that you get two equal vectors.

If you want to minimize that risk you need something with a good and guaranteed avalanche, like a cryptographic checksum. They are computationally expensive but you can use the sums directly (you said you need zeros and ones only) and concatenate if necessary. The problems with very small 'N' won't go away but are minimized.

It also depends a bit on what you plan to do, what you need it for. Sometimes, especially with Monte Carlo methods one PRNG seems to get better results than another one.

[2002] Glen, Amy. "On the Period Length of Pseudorandom Number Sequences." University of Adelaide, Australia (2002). Available at thesis.pdf (downloaded at 2016.08.18)

deamentiaemundi
  • 5,502
  • 2
  • 12
  • 20
0

If I run this function, let's say, 1 million times, do I obtain 1 million different combinations (?)

OP has code

srand(time(NULL)); // only once in my main function

time() typically returns a integer count of seconds. Then any call to the program - in the same second - will produce identical results, so testing may need to wait about 12 days to test 1 million combinations.

Even if another source is used for calling srand(unsigned seed), seed only takes on the the values of [0...UINT_MAX] which may be as small as 65,636 different values. (Typically unsigned has 4,294,967,296 different values).

Verify the range of unsigned and consider additional sources of "random" initialization such as process ID or dev/random

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • So i assume that I'm "guaranteed" 65,635 different random numbers. Am I wrong? – Francesco Di Lauro Aug 18 '16 at 18:23
  • @Francesco Di Lauro (65,536 vs. 65,535). Though not guaranteed by the spec, reasonable to assume. The details and quality of `srand()/rand()` are not specific. – chux - Reinstate Monica Aug 18 '16 at 18:55
  • @FrancescoDiLauro: RAND_MAX might be as small as 37267 so you are really only guaranteed 32768 different values. Depending on the RNG, there might be a (much) longer cycle even though individual values repeat. Or there might not be; the standard is lenient. Consider using a better RNG if it matters to you. – rici Aug 18 '16 at 19:10
  • @rici and OP Detail: The 65,536 or more answered/commented here is about the number of _different starting points_ in the random number generation sequence code will experience. This differs from the _range of values_ returned from `rand()` which is `[0...RAND_MAX]` and `RAND_MAX <= INT_MAX <= UINT_MAX`. Further the _cycle_, as commented by rici, could be _far_ longer before the _pattern_ repeats. – chux - Reinstate Monica Aug 18 '16 at 20:54
  • @chux: Another detail: 7.22.2.1/5: "The value of the RAND_MAX macro shall be at least 32767." Since `rand` returns a positive `int`, `32767 <= RAND_MAX <= INT_MAX`. `srand` does accept an `unsigned` but that has no impact on `RAND_MAX`. – rici Aug 18 '16 at 21:58
  • Also, your comment about having to wait 12 days implies that each run of the program only creates a single sequence. I don't think that is necessarily the case. – rici Aug 18 '16 at 22:01
  • @rici True. OP did say " If I run this function, let's say, 1 million times" and I took that to imply once per program and the program run many times. OTOH multiple times per program invocation does make more sense. Unfortunately the post is scant on these details. – chux - Reinstate Monica Aug 18 '16 at 22:16
  • My program has to generate 100 vectors, with something like n=120, and N=500 (numbers are made up but the order is correct) – Francesco Di Lauro Aug 20 '16 at 22:54