mxnet-ssd nan problem

I get this following nan:

Epoch[4] Batch [2140] Speed: 6.44 samples/sec Train-CrossEntropy=0.589661
Epoch[4] Batch [2140] Speed: 6.44 samples/sec Train-SmoothL1=1.534163
Epoch[4] Train-CrossEntropy=nan
Epoch[4] Train-SmoothL1=nan

There isn't any nan in each batch loss but nan in epoch loss.

Nov 09 '17 14:11 KeyKy

Maybe the problem auto_reset=True in speedometer

Nov 09 '17 15:11 KeyKy

Don't hide the nan problem, which means you have bad parameters set or something wrong with the data. See if you still get nan with smaller learning rate

Nov 09 '17 19:11 zhreshold

I find this source code in mxnet base_module.py,

for each epoch:
    if batch_end_callback is not None:
        batch_end_params = BatchEndParam(epoch=epoch, nbatch=nbatch,
                                                     eval_metric=eval_metric,
                                                     locals=locals())
       for callback in _as_list(batch_end_callback):
           callback(batch_end_params)  # -> speedometer

when auto_reset=True, speedometer will call

if self.auto_reset:
    param.eval_metric.reset()

which make num_inst = 0 and self.sum_metric = 0.0. So in epoch loss:

# one epoch of training is finished
for name, val in eval_metric.get_name_value():
    self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val)  # return nan because eval_metric is reset.

Nov 10 '17 02:11 KeyKy

That is validation metric, train metric was not affected

Nov 16 '17 06:11 zhreshold