FOTS.PyTorch icon indicating copy to clipboard operation
FOTS.PyTorch copied to clipboard

My trainig process is frozen

Open whansk50 opened this issue 2 years ago • 5 comments

Hello, training process is initiated without problem, but when some times left, it is frozen like:

image

it doesn't show any change on console

and I check GPUs at that time and what I got is GPU-Util(not memory) is full when the process is frozen (that I think this is a clue of this problem):

image

I fixed parameter like batch_size, worker, etc, but it doesn't help

Can anyone help?

my env is on miniconda3, and using CUDA 11.8, so version is: PyTorch 2.0.0 PyTorch Lightning 2.0.2

whansk50 avatar Jun 07 '23 08:06 whansk50

Did u use DDP training? If so, may be spawned process might occur some errors causing it to quit. So the main process was still waiting.

jiangxiluning avatar Jul 06 '23 08:07 jiangxiluning

Yeah, I fixed trainer like this to fit the version of Lightning(currently using with single gpu):

trainer = Trainer(
        logger=wandb_logger,
        callbacks=[checkpoint_callback],
        max_epochs=config.trainer.epochs,
        default_root_dir=root_dir,
        devices=gpus,
        accelerator='cuda',
        benchmark=True,
        sync_batchnorm=True,
        precision=config.precision,
        log_every_n_steps=config.trainer.log_every_n_steps,
        overfit_batches=config.trainer.overfit_batches,
        fast_dev_run=config.trainer.fast_dev_run,
        inference_mode=True,
        check_val_every_n_epoch=config.trainer.check_val_every_n_epoch,
        strategy=DDPStrategy(find_unused_parameters=True)
        )

Then is "DDPStrategy" related to this problem? I'm new at using Lightning, so how can I fix it?

whansk50 avatar Jul 07 '23 02:07 whansk50

You can use CPU training or single GPU training mode to check if any bad thing occurs.

jiangxiluning avatar Jul 08 '23 11:07 jiangxiluning

I already did with single GPU and got no problem. Then Isn't there any solution with multiple GPUs?

whansk50 avatar Jul 09 '23 23:07 whansk50

try to set the worker number of Dataset to 0 and see.

jiangxiluning avatar Jul 10 '23 09:07 jiangxiluning