Tensorflow: loss becomes 'NaN'

Question

I was doing CIFAR-10 training on CPU with Tensorflow. During the first few rounds, the loss seemed alright. But after the step 10210 the loss varies and ends up becoming NaN.

My network model the CIFAR-10 CNN model from their website. Here is my setting,

image_size = 32
num_channels = 3
num_classes = 10
num_batches_to_run = 50000
batch_size = 128
eval_batch_size = 64
initial_learning_rate = 0.1
learning_rate_decay_factor = 0.1
num_epochs_per_decay = 350.0
moving_average_decay = 0.9999

and the result is shown as below.

2017-05-12 21:53:05.125242: step 10210, loss = 4.99 (124.9 examples/sec; 1.025 sec/batch)
2017-05-12 21:53:13.960001: step 10220, loss = 7.55 (139.5 examples/sec; 0.918 sec/batch)
2017-05-12 21:53:23.491228: step 10230, loss = 6.63 (149.5 examples/sec; 0.856 sec/batch)
2017-05-12 21:53:33.355805: step 10240, loss = 8.08 (113.3 examples/sec; 1.129 sec/batch)
2017-05-12 21:53:43.007007: step 10250, loss = 7.18 (126.7 examples/sec; 1.010 sec/batch)
2017-05-12 21:53:52.650118: step 10260, loss = 16.61 (138.0 examples/sec; 0.928 sec/batch)
2017-05-12 21:54:02.537279: step 10270, loss = 9.60 (137.6 examples/sec; 0.930 sec/batch)
2017-05-12 21:54:12.390117: step 10280, loss = 46526.25 (145.5 examples/sec; 0.880 sec/batch)
2017-05-12 21:54:22.060741: step 10290, loss = 133479743509972411931057146822656.00 (130.4 examples/sec; 0.982 sec/batch)
2017-05-12 21:54:31.691058: step 10300, loss = nan (115.8 examples/sec; 1.105 sec/batch)

Any idea about the NaN loss?

could you decrease your learning rate to 0.01 or 0.001 and see how that goes ? — Harsha Pokkalla, May 13 '17 at 02:50
This question is answered pretty robustly here: https://stackoverflow.com/questions/40050397/tensorflow-nan-loss-reasons/45339569#45339569 — Free Url, Aug 15 '17 at 19:37

score 7 · Answer 1 · answered May 13 '17 at 04:21

7

This happens a lot in practice when your learning rate is too high, I tend to start at 0.001 and move from there, 0.1 is on the very high side on most datasets, especially if you aren't dividing your loss by your batch size.

answered May 13 '17 at 04:21

Simba

1,641
1
11
16

score 1 · Answer 2 · answered May 13 '17 at 02:22

You can clip the gradients, if you are using Keras with Tensorflow backend, you could do as follows,

The parameters clipnorm and clipvalue can be used with all optimizers to control gradient clipping:

 from keras import optimizers

 # All parameter gradients will be clipped to
 # a maximum norm of 1.
 sgd = optimizers.SGD(lr=0.01, clipnorm=1.)

or

 from keras import optimizers
 # All parameter gradients will be clipped to
 # a maximum value of 0.5 and
 # a minimum value of -0.5.
 sgd = optimizers.SGD(lr=0.01, clipvalue=0.5)

score 0 · Answer 3 · answered May 14 '17 at 11:08

0

You might have the cross entropy loss and take log(0). Just add a small constant within the log.

(you might also want to look into gradient clipping)

answered May 14 '17 at 11:08

Martin Thoma

124,992
159
614
958

Tensorflow: loss becomes 'NaN'

3 Answers3