mxnet-ssd icon indicating copy to clipboard operation
mxnet-ssd copied to clipboard

nan problem

Open KeyKy opened this issue 8 years ago • 4 comments

I get this following nan:

Epoch[4] Batch [2140] Speed: 6.44 samples/sec Train-CrossEntropy=0.589661
Epoch[4] Batch [2140] Speed: 6.44 samples/sec Train-SmoothL1=1.534163
Epoch[4] Train-CrossEntropy=nan
Epoch[4] Train-SmoothL1=nan

There isn't any nan in each batch loss but nan in epoch loss.

KeyKy avatar Nov 09 '17 14:11 KeyKy

Maybe the problem auto_reset=True in speedometer

KeyKy avatar Nov 09 '17 15:11 KeyKy

Don't hide the nan problem, which means you have bad parameters set or something wrong with the data. See if you still get nan with smaller learning rate

zhreshold avatar Nov 09 '17 19:11 zhreshold

I find this source code in mxnet base_module.py,

for each epoch:
    if batch_end_callback is not None:
        batch_end_params = BatchEndParam(epoch=epoch, nbatch=nbatch,
                                                     eval_metric=eval_metric,
                                                     locals=locals())
       for callback in _as_list(batch_end_callback):
           callback(batch_end_params)  # -> speedometer

when auto_reset=True, speedometer will call

if self.auto_reset:
    param.eval_metric.reset()

which make num_inst = 0 and self.sum_metric = 0.0. So in epoch loss:

# one epoch of training is finished
for name, val in eval_metric.get_name_value():
    self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val)  # return nan because eval_metric is reset.

KeyKy avatar Nov 10 '17 02:11 KeyKy

That is validation metric, train metric was not affected

zhreshold avatar Nov 16 '17 06:11 zhreshold