training_loss is 'nan'
training_loss becomes 'nan' after several epochs.
the error info is:
Traceback (most recent call last):
File "/home/ubuntu/hyx/nuplan-master/nuplan/planning/script/run_training.py", line 65, in main
engine.trainer.fit(model=engine.model, datamodule=engine.datamodule)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit
self._run(model)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 758, in _run
self.dispatch()
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 799, in dispatch
self.accelerator.start_training(self)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in run_stage
return self.run_train()
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 909, in run_train
self.training_type_plugin.reconciliate_processes(traceback.format_exc())
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 390, in reconciliate_processes
if len(os.listdir(sync_dir)) == self.world_size:
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpe4unhc9a'
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 871, in run_train
self.train_loop.run_training_epoch()
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 499, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 738, in run_training_batch
self.optimizer_step(optimizer, opt_idx, batch_idx, train_step_and_backward_closure)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 434, in optimizer_step
model_ref.optimizer_step(
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/core/lightning.py", line 1403, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 214, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/core/optimizer.py", line 134, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/accelerators/accelerator.py", line 325, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 93, in pre_optimizer_step
result = lambda_closure()
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 732, in train_step_and_backward_closure
result = self.training_step_and_backward(
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 845, in training_step_and_backward
self._check_finite(result.loss)
File "/home/ubuntu/env/anaconda3/envs/nuplan3/lib/python3.9/site-packages/pytorch_lightning/trainer/training_loop.py", line 856, in _check_finite
raise ValueError(f'The loss returned in `training_step` is {loss}.')
ValueError: The loss returned in `training_step` is nan.
I have set lr=1e-5 and even 1e-9 but it doesn't work
setting the parameter terminate_on_nan: false isn't helpful, either
the reason is not about dataset bacause I use the one_percent_of_lv_strip_scenarios to make sure that training data is not huge.
So how to SOLVE it?
Hi @CrisCloseTheDoor,
I've assigned this to another colleague of mine. In the meantime, can you share with us the full command that you ran?
Ping @christopher-motional
Hi @CrisCloseTheDoor,
I've assigned this to another colleague of mine. In the meantime, can you share with us the full command that you ran?
Ping @christopher-motional
thanks for your reply, here's my full command
experiment_name=vector_experiment
py_func=train
+training=training_vector_model
worker=sequential
scenario_builder=nuplan
scenario_filter=one_percent_of_boston_scenarios
lightning.trainer.params.max_epochs=100
data_loader.params.batch_size=16
data_loader.params.num_workers=16
data_loader.params.pin_memory=True
+lightning.trainer.find_unused_parameters=False
+lightning.distributed_training.scale_lr=1e-4
lightning.trainer.params.gradient_clip_val=0.3
cache.use_cache_without_dataset=True
cache.cache_path=***
with a deep search, I found that the 'nan' showed in the predictions data from lanegcn model, leading to loss 'nan'.
I have solved this problem. The reason is that FP16 was used for high efficiency, but the overflow occurred in some addition operations, thus leaded to 'nan' data.
A solution is to modify the config file nuplan/planning/script/config/training/lightning/default_lightning.yaml
precision: 32 # 16
Thanks for sharing your solution. This will be very helpful to the other users. Sorry, we could not get back in a timely manner