As mentioned in a comment, that shuffle algorithm is flawed. You should use the Fisher-Yates shuffle. The proof that the algorithm is biased is relatively simple: Consider the probability that the sequence of 1s and 0s is unaltered by the algorithm. That will happen if every one of the n
random numbers selected is less than n
, which has a probability of (n/N)n or nn/Nn. The correct probability is 1/(N choose n), which is n!/(N×(N-1)×…(N-n+1)). If n
is small relative to N
, the latter expression is quite close to n!/Nn. Since nn is quite a lot bigger than n!, the probability of the algorithm producing the unaltered sequence is much larger than it should be. (Sequences with most but not all of the 1s in their original spots are also over-produced, but not as dramatically.)
You should never call srand
more than once in any program (unless you really know what you are doing). srand
seeds the random number generator; after seeding, you should just call rand
each time you need a new number. (This point is motivated by the title for the question, and the fact that using srand
incorrectly seems to be very common.)
The standard C library rand
function does not provide any guarantees about quality, and some implementations have distressingly small ranges and short cycles. But they are probably good enough to do even a million random shuffles.
Even if your random number generator produced distinct sequences every time, and even if you fixed your shuffle function to do a proper Knuth-Yates shuffle, you will still get repeats because the vector you are shuffling has repeated values. As a result, two different shuffles can produce the same sequence. Consider the simple case where n is 2, so your initial vector is a two 1s followed by N-2 0s. The two 1s are indistinguishable from each other, so if your first step swaps to position k
and your second one to position l
, that will produce exactly the same result as swapping first to l
and then to k
.
I think what you really want to do is construct a random combination of n
out of N
objects. There are N choose n
such possible combinations; ideally, each such combination should be generated with equal probability.
Below are a few algorithms which accomplish this. All of them are O(N) time, because it is impossible to fill in a boolean vector of length N in less than linear time. However, if you could live with just the list of indices of the 1s, then the second algorithm is O(n), or O(n log n) if you need the indices to be sorted. A true O(n) algorithm which produces indices in sorted order can be found in the paper referenced in this answer, and that might be appropriate if N
is very large and n
is reasonably small.
The following function is used by several of the algorithms. It could be improved, as its comment indicates, but it will work fine with a good RNG. rand()
is not a good RNG.
/* This is not a good implementation of rand_range
* because some rand() implementations exhibit poor randomness
* of low-order bits. (And also the bias issue if RAND_MAX is
* small.) Better random number generators exist :) */
/* Produces a random integer in the half-open range [lo, hi) */
int rand_range(int lo, int hi) {
return lo + rand() % (hi - lo);
}
1. Reservoir sampling
A simple algorithm which works with large sample sizes is reservoir sampling:
/* vec must be a vector of size at least N. Randomly
* fills the vector with n 1s and N-n 0s.
*/
void random_fill(int vec[], int N, int n) {
int i;
for (i = 0; n; ++i) {
if (rand_range(0, N-i) < n) {
vec[i] = 1;
--n;
}
else
vec[i] = 0;
}
for (; i < N; ++i) vec[i] = 0;
}
2. Shuffling indices
Another possibility is to generate the indices of the 1s by doing a prefix shuffle on the list of indices:
int random_fill(int vec[], int N, int n) {
/* For simplicity, use a temporary vector */
int* inds = malloc(N * sizeof *inds);
for (int i = 0; i < N; ++i) inds[i] = i;
for (int i = 0; i < n; ++i) {
int j = rand_range(i, N);
int t = inds[j]; inds[j] = inds[i]; inds[i] = t;
}
for (int i = 0; i < N; ++i) vec[i] = 0;
for (int i = 0; i < n; ++i) vec[inds[i]] = 1;
free(inds);
}
3. Select from the enumerated sequence
If N choose n
is not too big (that is, you can compute it without integer overflow), one way of generating a random sequence is to choose a random integer less than N choose n
and then produce the combination with that ordinal using some enumeration of possible sequences. (If you are using rand()
, you should be aware that even if N choose n
is computable without overflow, it might still be greater than RAND_MAX
, in which case rand()
will not generate the full range of possible ordinals.)
The above reservoir sampling algorithm can be adapted directly to produce an enumeration:
/* Fills vec with the kth sequence of n 1s and N-n 0s, using
* an unspecified ordinal sequence; every value of k between 0
* and (N choose n) - 1 produces a distinct sequence.
*/
void ordinal_fill(int vec[], int N, int n, int k) {
for (int i = 0; N; ++i, --N) {
int r = (k * n) % N;
if (r < n) {
vec[i] = 1;
k = (k * n) / N;
--n;
} else {
vec[i] = 0;
k = (k * (N - n)) / N;
}
}
}
(live on ideone)
The program above does not make any assumptions about the ordinal value other
than that it is positive and fits in an integer. In effect, it will be taken
modulo N choose n
, although that value is never explicitly computed. If you
used uint64_t
instead of int
and a random number generator which could
produce random numbers in a large range, you could generate random sequences by just feeding the function a random number. Of course, this wouldn't guarantee that the sequences were unique.
In essence, the function works by using the ordinal value (k) as the source of the "random" numbers required by the reservoir sampling algorithm. Every ordinal number (mod N choose n
) corresponds to a different sequence (proof left as an exercise). Because the ordinal space is partitioned by modulo rather than magnitude, the ordinal sequence is probably not particularly useful as a sequence, but it is guaranteed to be a complete sequence.
Partitioning by magnitude (using something like the combinatorial numbering
system) might be faster -- it doesn't require division, for example -- but it would require efficient access to binomial numbers, which the above function does not need. If the binomial coefficients are computed for each step, then a division will be needed, removing a lot of the speed advantage.