1

I'm using numpy.memmap to compute large covarince matrix.

The question is why is multiplication in second case(when I store A.T on disk) is slower?

def matrix_append_row(fm,tp,mat):
    #check if number of columns is the same
    rows= fm.shape[0] + mat.shape[0]
    new_fm = np.memmap(fm.filename, mode='r+', dtype= tp, shape=(rows,fm.shape[1]))
    new_fm[fm.shape[0]:rows,:]= mat[:]
    return new_fm
def generate_and_store_data(cols,batch,iter,tp):
    #create memmap file and append generated data in cycle

    #can't create empty array - need to store first entry by copy not by append
    fm= np.memmap('A.npy', dtype=tp, mode='w+', shape=(batch,cols))

    for i in xrange(iter):
        data= np.random.rand(batch,cols)*100  # [0-1]*100
        # print i
        # t0= time.time()
        if i==0:
            fm[:]= data[:]
        else:   
            fm= matrix_append_row(fm,tp,data)
        # print (time.time()-t0)

        # fm.flush()
        # print fm.shape

    return fm


A= generate_and_store_data(1000,5000,10,'float32')
#cov= A.T*A

# A memmaped file
print A.shape
M= A.shape[0]
N= A.shape[1]
#create cov mat
cov= np.memmap('cov.npy', dtype='float32', mode='w+', shape=(N,N))

#A.T don't copy data just view?
t0= time.time()
np.dot(A.T,A,out= cov)
print (time.time()-t0)

print A.T.shape

cov_= np.memmap('cov_.npy', dtype='float32', mode='w+', shape=(N,N))
At= np.memmap('At.npy', dtype='float32', mode='w+', shape=(N,M))

t0= time.time()
At= A.T # only reference
print (time.time()-t0)

t0= time.time()
At[:]= A.T # copy same as At[:]= (A.T)[:] ?
print (time.time()-t0)

t0= time.time()
At[:]= (A.T)[:]
print (time.time()-t0)

t0= time.time()
np.dot(At,A,out= cov_)
print (time.time()-t0)

output is

(50000, 1000)
2.84399986267
(1000, 50000)
0.0
2.17100000381
2.07899999619
2.73399996758

Update:

also tried blockwise matrix multiplication, but it is very slow (maybe I need to adjust block_size)

cov2= np.memmap('cov2.npy', dtype='float32', mode='w+', shape=(N,N))
block_size=100
t0= time.time()
for i_start in range(0, At.shape[0], block_size):
    for j_start in range(0, A.shape[1], block_size):
        for k_start in range(0, At.shape[1], block_size):
            cov2[i_start:i_start+block_size, j_start:j_start + block_size] +=np.dot(At[i_start:i_start + block_size, k_start:k_start + block_size],A[k_start:k_start + block_size, j_start:j_start + block_size])
print (time.time()-t0)
mrgloom
  • 20,061
  • 36
  • 171
  • 301
  • Probably because of disk access. When I try to run your sample, I have "Name A is not defined". Can you tell us more about the data ? – Louis May 28 '14 at 07:55
  • @Louis A stored on disk as memmaped file 50000 x 1000 float32. See generate_and_store_data func. – mrgloom May 28 '14 at 08:57
  • I don't find your question very clear. Are you wondering about time differences of copying or of the dot product? Maybe you could indicate which times you're wondering about exactly. Additionally, you really need to repeat the timing a few times to be sure of it. – Midnighter May 28 '14 at 09:31
  • @Midnighter I compare dot product time due to something like http://stackoverflow.com/questions/17954990/performance-of-row-vs-column-operations-in-numpy but seems using A.T using some OS caching. – mrgloom May 28 '14 at 10:13
  • I understand why you do it but from the times that you posted, the first one is the slowest... so it's not very clear what you're asking for. – Midnighter May 28 '14 at 11:01
  • @Midnighter but comparing total time (2.84) < (2.07+2.73) so it's not useful to store transposed matrix on disk if it will not be reused. – mrgloom May 28 '14 at 11:08

1 Answers1

2

So this is what I get for a smaller array even when the memory layout change is part of the timing:

arr = numpy.random.rand(500, 100)
arr *= 100

%timeit numpy.dot(arr.T, arr)
100 loops, best of 3: 7.52 ms per loop

%timeit numpy.dot(numpy.asfortranarray(arr.T), arr)
100 loops, best of 3: 7.53 ms per loop

%timeit numpy.dot(arr.T, numpy.asfortranarray(arr))
100 loops, best of 3: 5.26 ms per loop

I'm not sure why the size you posted has such outrageously different run times but here are the results:

arr = numpy.random.rand(50000, 1000)
arr *= 100
cov = numpy.zeros((1000, 1000))

%timeit numpy.dot(arr.T, arr, out=cov)
1 loops, best of 3: 12min 45s per loop

%timeit numpy.dot(numpy.asfortranarray(arr.T), arr, out=cov)
1 loops, best of 3: 12min 42s per loop

%timeit numpy.dot(arr.T, numpy.asfortranarray(arr), out=cov)
1 loops, best of 3: 7min 19s per loop

Of course, this also depends a lot on what BLAS imlementation you compiled numpy with but changing the memory layout seems well worth it.

Midnighter
  • 3,771
  • 2
  • 29
  • 43