Python: randomly assigning titles to documents in a corpus

Question

I have a large document corpus, D which is basically a Python list of n filtered tweets.

For example, D[0] is "New Exploit to 'Hack Android Phones Remotely' threatens Millions of Devices"

Also, n is of the order 10^4.
Say, there's another list of m = 10 topics for my documents in Z, that I wish to randomly assign to each document and,

Z = ['hack', 'tools', 'android', 'google', 'anonymous', ... ].

How do I go about creating an n x 2 array, such that that the assignment of topics is (as close to) a truly random process?

Edit:

I'm not sure how to code this. Sorry if the explanation is a little vague, but there isn't much information to give. I simply want a way to map from Z to D, randomly (to obtain an n x 2 array not an n x m array, honest mistake).

It would be helpful if you clarify your question with a simple example using small values of n and m. Also, you should post your own attempt at coding this. — PM 2Ring, Mar 18 '16 at 11:09
@PM2Ring I've added as much detail as I could. There's not a lot going on in the code itself. I simply want to map from Z to D, *randomly*. — , Mar 18 '16 at 11:45
I can show you how to build a Python list of _n_ rows. The _i_ th row consists of _m_ tuples. Each tuple pairs the _i_ th tweet with one of the _m_ topics, in random order. Would that help? — PM 2Ring, Mar 18 '16 at 12:05
@PM2Ring yes, that should work. I realised that I don't need an n x m matrix at all. — , Mar 18 '16 at 12:13
Take a look at [random.choice](https://docs.python.org/3/library/random.html#random.choice); numpy may provide something similar, but I don't know numpy. — PM 2Ring, Mar 18 '16 at 12:13
Turns out there's also a [numpy.random.choice](http://docs.scipy.org/doc/numpy-1.10.0/reference/generated/numpy.random.choice.html). Worked, thanks! — , Mar 18 '16 at 12:32

score 0 · Accepted Answer · edited May 23 '17 at 11:58

0

I think this is what you are after.

>>> D = [1,2,3,4,5,6,7,8,9]
>>> Z = ['a','b','c','d','e','f','g']
>>> [[i, random.choice(Z)] for i in D]
[[1, 'a'], [2, 'd'], [3, 'c'], [4, 'f'], [5, 'b'], [6, 'g'], [7, 'f'], [8, 'f'], [9, 'f']]

This list comprehension iterates through D (Your corpus) and matches each element to a random element of Z (your topics).

Tuples might be a better choice than lists for the individual pairs though, as they are more commonly used to represent a collection of different things - see this answer for when to use Lists vs Tuples.

edited May 23 '17 at 11:58

Community

1
1

answered Mar 18 '16 at 12:15

SiHa

7,830
13
34
43

I might, in fact, need tuples afterall. Thank you. – Mar 18 '16 at 12:27

Python: randomly assigning titles to documents in a corpus

1 Answers1