0

I'm trying to efficiently select a random non-zero column index for each row of a large sparse SciPy matrix. I can't seem to figure out a vectorized way of doing it, so I'm resorting to a very slow Python loop:

random_columns = np.zeros((sparse_matrix.shape[0]))
for i,row in enumerate(sparse_matrix):
    random_columns[i] = (np.random.choice(row.nonzero()[1]))

My matrix is an approximately (4000000, 800) csr_matrix with almost every row having only one non-zero value, so the Python loop is killing performance. There has to be a better way!

EDIT I can make it about 2x faster by directly accessing the underlying data of the csr_matrix:

random_columns[i] = row.indices[np.random.choice(len(row.data))]
wxs
  • 5,617
  • 5
  • 36
  • 51
  • your code throw `i,np.random.choice(row.nonzero()[1]: tuple index out of range` – farhawa May 22 '15 at 00:14
  • Um well sparse_matrix needs to be 2 dimensional. This code works for me… – wxs May 22 '15 at 00:22
  • yeah but you are selecting the first dimension to define `np.zeros((sparse_matrix.shape[0]))` and that will give a 1d array? – farhawa May 22 '15 at 00:23
  • `row` is a row from `sparse_matrix` not from `random_columns`. – wxs May 22 '15 at 00:27
  • what is the shape of `sparse_matrix `? – farhawa May 22 '15 at 00:31
  • It sounds to me like you could use non_zero (which will give you a 2 by X array of the indexes of non zero value), and then simply eliminate the elements that have the same value for the row leaving one from each row. – David May 22 '15 at 00:32

1 Answers1

2

Have you looked at the underlying data representation for this, and other sparse formats?

For example, for small matrix

In [257]: M = sparse.rand(10,10,.1,format='csr')

In [258]: M
Out[258]: 
<10x10 sparse matrix of type '<class 'numpy.float64'>'
    with 10 stored elements in Compressed Sparse Row format>

In [259]: M.data
Out[259]: 
array([ 0.86390256,  0.85244302,  0.88549326,  0.78737361,  0.99918561,
        0.89862529,  0.86842524,  0.25714778,  0.4174032 ,  0.33137501])

In [260]: M.indices
Out[260]: array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32)

In [261]: M.indptr
Out[261]: array([ 0,  1,  1,  3,  4,  4,  5,  5,  7,  8, 10], dtype=int32)

For csr the indices are a bit obscure. Or rather, the column index for each nonzero value is present in M.indices, but it takes a bit of calculation to determine which ones belong to which row.

For other formats the connection is more obvious.

For lil we have 2 lists of lists

In [262]: Ml=M.tolil()

In [263]: Ml.data
Out[263]: 
array([[0.863902562935336], [], [0.8524430195076207, 0.8854932609233054],
       [0.7873736126927198], [], [0.9991856090158101], [],
       [0.8986252926235274, 0.8684252408594123], [0.2571477751356357],
       [0.4174032029993796, 0.3313750148434619]], dtype=object)

In [264]: Ml.rows
Out[264]: array([[1], [], [5, 8], [8], [], [9], [], [0, 3], [9], [4, 5]], dtype=object)

In [265]: [np.random.choice(x) for x in Ml.rows if x]
# some rows might not have any nonzero
Out[265]: [1, 5, 8, 9, 3, 9, 5]

In [268]: [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
Out[268]: [1, 5, 8, 9, 0, 9, 4]

You can also take nonzero for the whole matrix

 In [274]: M.nonzero()
 Out[274]: 
 (array([0, 2, 2, 3, 5, 7, 7, 8, 9, 9], dtype=int32),
 array([1, 5, 8, 8, 9, 0, 3, 9, 4, 5], dtype=int32))

These are the same arrays that you would get with M.tocoo() and looking at the row and col attributes. In theory you could use groupby to get sublists of columns, and pick from those. But again you have lists or generators and iteration.

I don't know if anyone of these representations is faster or not.

There may be some limits to vectorizing the problem. The number of nonzeros (the inputs to choices) will differ by row. Some rows have non, others 1 or more. Anytime you encounter arrays or lists of different lengths, it is difficult to vectorize the operation. If you can't arrange the values into a regular 2d array you can't operate on them as a whole with array operations.


The lil format is worth looking at:

In [276]: timeit [np.random.choice(x.nonzero()[1]) for x in M if len(x.nonzero()[1])]
100 loops, best of 3: 4.24 ms per loop

In [289]: timeit [np.random.choice(row.indices) for row in M if len(row.indices)]
1000 loops, best of 3: 1.52 ms per loop
# 3x speedup using row.indices

In [277]: %%timeit
   .....: Ml=M.tolil()
   .....: [np.random.choice(x) for x in Ml.rows if x]
   .....: 
10000 loops, best of 3: 181 µs per loop
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Thank you for the detailed answer! The `lil` format seems to do it for me, it has cut things down by a factor of 10. – wxs May 22 '15 at 15:22