Tensorflow is too slow in a python for loop

Question

I would like to create in Tensorflow, a function that for every line of a given data X, is applying the softmax function only for some sampled classes, lets say 2, out of K total classes, and returns a matrix S, where S.shape = (N,K) (N: the lines of the given Data and K the total classes).

The matrix S finally would contain zeros, and non_zero values in the indexes defined for every line by the sampled classes.

In simple python I use advanced indexing, but in Tensorflow I cannot figure out how to make it. My initial question was this, where I present the numpy code.

So I tried to find a solution in Tensorflow and the main idea was not to use the S as a 2-d matrix but as an 1-d array. The code looks like that:

num_samps = 2
S = tf.Variable(tf.zeros(shape=(N*K)))
W = tf.Variable(tf.random_uniform((K,D)))
tfx = tf.placeholder(tf.float32,shape=(None,D))
sampled_ind = tf.random_uniform(dtype=tf.int32, minval=0, maxval=K-1, shape=[num_samps])
ar_to_sof = tf.matmul(tfx,tf.gather(W,sampled_ind),transpose_b=True)
updates = tf.reshape(tf.nn.softmax(ar_to_sof),shape=(num_samps,))
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
for line in range(N):
    inds_new = sampled_ind + line*K
    sess.run(tf.scatter_update(S,inds_new,updates), feed_dict={tfx: X[line:line+1]})

S = tf.reshape(S,shape=(N,K))

This is working, and the result is the expected. But it is running extremely slow. Why is that happening? How could I make that work faster?

sygi · Accepted Answer · 2016-11-14T13:18:56.897

While programming in tensorflow, it is crucial to learn a distinction between defining operations and executing them. Most of the functions starting with tf., when you run in python add operations to the computation graph.

For example, when you do:

tf.scatter_update(S,inds_new,updates)

as well as:

inds_new = sampled_ind + line*K

multiple times, your computation graph grows beyond what is necessary, filling all the memory and slowing things down enormously.

What you should do instead is to define the computation one time, before the loop:

init = tf.initialize_all_variables()
inds_new = sampled_ind + line*K
update_op = tf.scatter_update(S, inds_new, updates)
sess = tf.Session()
sess.run(init)
for line in range(N):
    sess.run(update_op, feed_dict={tfx: X[line:line+1]})

This way your computation graph contains only one copy of the inds_new and update_op. Note that when you execute update_op, the inds_new will be implicitly executed too, as it is its parent in the computation graph.

You should also know that update_op will probably have different results each time it is run and it is fine and expected.

By the way, a good way to debug this kind of problem is to visualize the computation graph using tensorboard. In code you add:

summary_writer = tf.train.SummaryWriter('some_logdir', sess.graph_def)

and then run in console:

tensorboard --logdir=some_logdir

on the served html page there will be a picture of computation graph, where you can examine your tensors.

Thanks a lot, this is working and answers exactly my question! But the problem still remains, the numpy code to create the matrix S is still faster than this. And i use only tensorlow functions... Do you know why that happens? Should i create a new op with c++ to get the speed up? — Cfis Yoi, Nov 16 '16 at 15:22
By faster you mean 20% faster, or 20x faster? 20% slower tensorflow on a CPU is an expected behavior. Do you have a good, CUDA-enabled GPU (and tensorflow installation that uses it)? Tensorflow is meant for situation where you use a GPU/a cluster. — sygi, Nov 16 '16 at 15:30

score 0 · Answer 2 · answered Jul 17 '17 at 11:05

Keep in mind that tf.scatter_update will return the Tensor S, which means a large memory copy in session run, or even network copy in distributed environment. The solution is, based on @sygi's answer:

update_op = tf.scatter_update(S, inds_new, updates)
update_op_op = update_op.op

Then in session run, you do this

sess.run(update_op_op)

This will avoid copy the large Tensor S.

Tensorflow is too slow in a python for loop

2 Answers2