17

Currently i'm using something like this:

    images = Image.all()
    count = images.count()
    random_numb = random.randrange(1, count)
    image = Image.get_by_id(random_numb)

But it turns out that the ids in the datastore on AppEngine don't start from 1. I have two images in datastore and their ids are 6001 and 7001.

Is there a better way to retrieve random images?

user216171
  • 1,686
  • 3
  • 15
  • 22

4 Answers4

18

The datastore is distributed, so IDs are non-sequential: two datastore nodes need to be able to generate an ID at the same time without causing a conflict.

To get a random entity, you can attach a random float between 0 and 1 to each entity on create. Then to query, do something like this:

rand_num = random.random()
entity = MyModel.all().order('rand_num').filter('rand_num >=', rand_num).get()
if entity is None:
  entity = MyModel.all().order('rand_num').get()

Edit: Updated fall-through case per Nick's suggestion.

Drew Sears
  • 12,812
  • 1
  • 32
  • 41
  • 4
    In the `entity is None` case, you should simply fetch the first entity, ordered by `rand_num`, thus treating the entities like a circular buffer. The way you're currently doing it makes the last entity very slightly more likely to be picked than all the others. – Nick Johnson May 15 '11 at 22:51
  • How would this query be indexed if at all? I had to implement a solution for this and did not choose this method for fear of lack of efficiency of this type of query. Not sure if my fears are founded (see solution below). – Will Curran May 23 '11 at 13:40
  • Every property automatically includes an ascending and descending index, unless you explicitly disable it. The code above should be efficient at scale. I've updated it to reflect Nick's revision. – Drew Sears May 23 '11 at 14:11
  • 7
    This solution only works well for datastores with many entities and where there is a lot of churn. If there are just a few entities that don't often change, then some entities will tend to be chosen far more than others. In the extreme case, if a datastore had only two entities with rand_num .01 and .99 respectively, then the first will be chosen almost every time. To fix this, (1) change rand_num regularly (perhaps on every query) and (2) randomly use the ">=" and "<" operators. – speedplane Jul 09 '12 at 05:32
  • 1
    @DrewSears I have already created 10000 entities, so I cannot attach a new random number to each entity, or I dont know there is a way to do it. In such case, how can I get a random entity. Thanks – John Mar 24 '13 at 08:25
  • @speedplane if you randomize the operators the trick of checking for nullability and getting the first in line to create a circular buffer will not work because on the "<" case you will have to get the last instead of the first. Alternativally, one can randomize the order "rand_num" and "-rand_num" instead. – Allan Veloso Dec 30 '18 at 22:21
10

Another solution (if you don't want to add an additional property). Keep a set of keys in memory.

import random

# Get all the keys, not the Entities
q = ItemUser.all(keys_only=True).filter('is_active =', True)
item_keys = q.fetch(2000) 

# Get a random set of those keys, in this case 20 
random_keys = random.sample(item_keys, 20)

# Get those 20 Entities
items = db.get(random_keys)

The above code illustrates the basic method for getting only keys and then creating a random set with which to do a batch get. You could keep that set of keys in memory, add to it as you create new ItemUser Entities, and then have a method that returns a n random Entities. You'll have to implement some overhead to manage the memcached keys. I like this solution better if you're performing the query for random elements often (I assume using a batch get for n Entities is more efficient than a query for n Entities).

Will Curran
  • 6,959
  • 15
  • 59
  • 92
6

I think Drew Sears's answer above (attach a random float to each entity on create) has a potential problem: every item doesn't have an equal chance of getting picked. For example, if there are only 2 entities, and one gets a rand_num of 0.2499, and the other gets 0.25, the 0.25 one will get picked almost all the time. This might or might not matter to your application. You could fix this by changing the rand_num of an entity every time it is selected, but that means each read also requires a write.

And pix's answer will always select the first key.

Here's the best general-purpose solution I could come up with:

num_images = Image.all().count()
offset = random.randrange(0, num_images)
image = Image.all().fetch(1, offset)[0]

No additional properties needed, but the downside is that count() and fetch() both have performance implications if the number of Images is large.

Dan Tasse
  • 161
  • 2
  • 5
  • to solve the count() performance problem, I use Shardingcounters (https://developers.google.com/appengine/articles/sharding_counters) – John Mar 24 '13 at 08:31
1

Another (less efficient) method, which requires no setup:

query = MyModel.all(keys_only=True)

# query.filter("...")

selected_key = None
n = 0
for key in query:
  if random.randint(0,n)==0:
    selected_key = key
  n += 1

# just in case the query is empty
if selected_key is None:
  entry = None
else:
  entry = MyModel.get(selected_key)
pix
  • 5,052
  • 2
  • 23
  • 25