nuplan-devkit icon indicating copy to clipboard operation
nuplan-devkit copied to clipboard

training_loss is 'nan'

Open CrisCloseTheDoor opened this issue 3 years ago • 2 comments

training_loss becomes 'nan' after several epochs.

the error info is:

Traceback (most recent call last):
  File "/home/ubuntu/hyx/nuplan-master/nuplan/planning/script/run_training.py", line 65, in main
    engine.trainer.fit(model=engine.model, datamodule=engine.datamodule)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
    self._run(model)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
    self.dispatch()
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
    self.accelerator.start_training(self)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
    self._results = trainer.run_stage()
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
    return self.run_train()
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 909, in run_train
    self.training_type_plugin.reconciliate_processes(traceback.format_exc())
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 390, in reconciliate_processes
    if len(os.listdir(sync_dir)) == self.world_size:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpe4unhc9a'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
    self.train_loop.run_training_epoch()
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
    self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
    model_ref.optimizer_step(
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
    optimizer.step(closure=optimizer_closure)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
    self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
    trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
    make_optimizer_step = self.precision_plugin.pre_optimizer_step(
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 93, in pre_optimizer_step
    result = lambda_closure()
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
    result = self.training_step_and_backward(
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 845, in training_step_and_backward
    self._check_finite(result.loss)
  File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 856, in _check_finite
    raise ValueError(f'The loss returned in `training_step` is {loss}.')
ValueError: The loss returned in `training_step` is nan.

I have set lr=1e-5 and even 1e-9 but it doesn't work setting the parameter terminate_on_nan: false isn't helpful, either

the reason is not about dataset bacause I use the one_percent_of_lv_strip_scenarios to make sure that training data is not huge.

So how to SOLVE it?

CrisCloseTheDoor avatar Jul 22 '22 10:07 CrisCloseTheDoor

Hi @CrisCloseTheDoor,

I've assigned this to another colleague of mine. In the meantime, can you share with us the full command that you ran?

Ping @christopher-motional

patk-motional avatar Jul 25 '22 06:07 patk-motional

Hi @CrisCloseTheDoor,

I've assigned this to another colleague of mine. In the meantime, can you share with us the full command that you ran?

Ping @christopher-motional

thanks for your reply, here's my full command

experiment_name=vector_experiment
py_func=train
+training=training_vector_model
worker=sequential
scenario_builder=nuplan
scenario_filter=one_percent_of_boston_scenarios
lightning.trainer.params.max_epochs=100
data_loader.params.batch_size=16
data_loader.params.num_workers=16
data_loader.params.pin_memory=True
+lightning.trainer.find_unused_parameters=False
+lightning.distributed_training.scale_lr=1e-4
lightning.trainer.params.gradient_clip_val=0.3
cache.use_cache_without_dataset=True
cache.cache_path=***

with a deep search, I found that the 'nan' showed in the predictions data from lanegcn model, leading to loss 'nan'.

CrisCloseTheDoor avatar Jul 25 '22 06:07 CrisCloseTheDoor

I have solved this problem. The reason is that FP16 was used for high efficiency, but the overflow occurred in some addition operations, thus leaded to 'nan' data. A solution is to modify the config file nuplan/planning/script/config/training/lightning/default_lightning.yaml

precision: 32 # 16

CrisCloseTheDoor avatar Aug 12 '22 04:08 CrisCloseTheDoor

Thanks for sharing your solution. This will be very helpful to the other users. Sorry, we could not get back in a timely manner

patk-motional avatar Aug 23 '22 08:08 patk-motional