My trainig process is frozen
Hello, training process is initiated without problem, but when some times left, it is frozen like:
it doesn't show any change on console
and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):
I fixed parameter like batch_size, worker, etc, but it doesn't help
Can anyone help?
my env is on miniconda3, and using CUDA 11.8, so version is: PyTorch 2.0.0 PyTorch Lightning 2.0.2
Did u use DDP training? If so, may be spawned process might occur some errors causing it to quit. So the main process was still waiting.
Yeah, I fixed trainer like this to fit the version of Lightning(currently using with single gpu):
trainer = Trainer(
logger=wandb_logger,
callbacks=[checkpoint_callback],
max_epochs=config.trainer.epochs,
default_root_dir=root_dir,
devices=gpus,
accelerator='cuda',
benchmark=True,
sync_batchnorm=True,
precision=config.precision,
log_every_n_steps=config.trainer.log_every_n_steps,
overfit_batches=config.trainer.overfit_batches,
fast_dev_run=config.trainer.fast_dev_run,
inference_mode=True,
check_val_every_n_epoch=config.trainer.check_val_every_n_epoch,
strategy=DDPStrategy(find_unused_parameters=True)
)
Then is "DDPStrategy" related to this problem? I'm new at using Lightning, so how can I fix it?
You can use CPU training or single GPU training mode to check if any bad thing occurs.
I already did with single GPU and got no problem. Then Isn't there any solution with multiple GPUs?
try to set the worker number of Dataset to 0 and see.