mxnet-ssd
mxnet-ssd copied to clipboard
nan problem
I get this following nan:
Epoch[4] Batch [2140] Speed: 6.44 samples/sec Train-CrossEntropy=0.589661
Epoch[4] Batch [2140] Speed: 6.44 samples/sec Train-SmoothL1=1.534163
Epoch[4] Train-CrossEntropy=nan
Epoch[4] Train-SmoothL1=nan
There isn't any nan in each batch loss but nan in epoch loss.
Maybe the problem auto_reset=True in speedometer
Don't hide the nan problem, which means you have bad parameters set or something wrong with the data. See if you still get nan with smaller learning rate
I find this source code in mxnet base_module.py,
for each epoch:
if batch_end_callback is not None:
batch_end_params = BatchEndParam(epoch=epoch, nbatch=nbatch,
eval_metric=eval_metric,
locals=locals())
for callback in _as_list(batch_end_callback):
callback(batch_end_params) # -> speedometer
when auto_reset=True, speedometer will call
if self.auto_reset:
param.eval_metric.reset()
which make num_inst = 0 and self.sum_metric = 0.0. So in epoch loss:
# one epoch of training is finished
for name, val in eval_metric.get_name_value():
self.logger.info('Epoch[%d] Train-%s=%f', epoch, name, val) # return nan because eval_metric is reset.
That is validation metric, train metric was not affected