2

I have a list:

a = [1,2,1,1,3,5,6,2]

I want to select, say 3 elements at random from this list, but they must all be different.

I need to preserve the 'weight' of each element, so sampling from set(a) is not possible.

So far, my solution is:

while condition == False:
    mysample = random.sample(a, 3)
    if len(set(mysample)) - len(mysample) !=0:
        condition = False
    else:
        condition = True

But this forces me to re-sample as many times as it takes for the elements to all be different. This works fine for small sampling, but for large sampling, my code becomes very inefficient...

amphetamachine
  • 27,620
  • 12
  • 60
  • 72
  • It can return duplicates if you give it a list with duplicates. The OP wants a list with no duplicates but that is still weighted such that more common elements are more likely to appear than less common ones. – John Kugelman Mar 03 '15 at 21:51
  • @aruisdante `random.sample(a, 3)` yielded `[1, 1, 6]` on my machine. It returned two of the three `1`s in the list. – John Kugelman Mar 03 '15 at 21:52
  • 1
    sample 1, then remove that all occurences of that in the original list. Repeat until you have the required number of elements. Have a look here: http://stackoverflow.com/questions/1157106/remove-all-occurences-of-a-value-from-a-python-list – Calum Mar 03 '15 at 21:53
  • @JohnKugelman Ah wait, they do say later that it's uniqueness based on index, not value. They don't highlight that well. My bad. – aruisdante Mar 03 '15 at 21:53

4 Answers4

3

You can shuffle and take the first three non repeated elements:

import random
random.shuffle(your_list)
three_elements = set()
for v in your_list:
  if len(three_elements) == 3: break
  three_elements.add(v)
John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Luís Bianchin
  • 2,327
  • 1
  • 28
  • 36
1
l = []
seen = set()
while len(l) < 3:
    ch = choice(a)
    if ch not in seen:
        l.append(ch)
        seen.add(ch)
print(l)

Depending on the ratio of actual different numbers to elements different approaches will have different advantages:

In [7]: a = [choice(range(10000)) for _ in range(100000)]

In [6]: import random

In [7]: a = [choice(range(10000)) for _ in range(100000)]

In [8]: %%timeit
random.shuffle(a)
three_elements = set()
for v in a:
    if len(three_elements) == 5000:
        break
    if not v in three_elements:
        three_elements.add(v)
   ...: 
10 loops, best of 3: 36.5 ms per loop

In [9]: %%timeit                          
l = []
seen = set()
while len(l) < 5000:
    ch = choice(a)
    if ch not in seen:
        l.append(ch)
        seen.add(ch)
   ...: 
100 loops, best of 3: 5.16 ms per loop

Running your code after 10 mins I had to exit the process as it was still going so whatever you choose will be a major improvement.

Using shuffle would be more efficient if you had a greater ratio of repeats to actual items in the list and you wanted a very large sample size, otherwise the cost of shuffling will make it less efficient than simply using a set and choice,

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0
while count < sampleSize: # where sampeSize is the number of values you want
    s = random.sample(a, 1)
    filter(lambda x: x != s, a)
    mysample.append(s)
    count += 1
Calum
  • 2,110
  • 2
  • 22
  • 39
0

This is probably more complicated than necessary, but here is an implementation that uses a modified version of reservoir sampling.

import itertools
import random

def element_at(iterable, index, default=None):
    return next(itertools.islice(iterable, index, None), default)

def sample_unique(iterable, size):
    S = set()
    for index, item in enumerate(iterable):
        if len(S) < size:
            S.add(item)
        else:
            r = random.randint(0, index)
            if r < size:
                other = element_at(S, r)
                if item not in S:
                    S.remove(other)
                    S.add(item)
    return S
Timothy Shields
  • 75,459
  • 18
  • 120
  • 173