Pointnet2_PyTorch icon indicating copy to clipboard operation
Pointnet2_PyTorch copied to clipboard

The classification training exits after 26th epoch

Open sheshap opened this issue 5 years ago • 4 comments

I just downloaded the code and followed the instructions (installed Python 3.6, Pytorch 1.5 with CUDA 10.2)

And when I started training using Multiple GPUs script the training abruptly exited without any error after the 26th epoch

Epoch 26: : 450it [00:19, 23.47it/s, loss=0.359, train_acc=0.875, v_num=1, val_acc=0.897, val_loss=0.323]

I am using 4 RTX 2080 Ti GPUs.

sheshap avatar Nov 15 '20 05:11 sheshap

I have similar issue and the same versions. Did you manage to solve it?

lokneey avatar Jan 13 '21 14:01 lokneey

I got the same issue, with single GPU (I tried both 1080TI and 2080TI). Not sure if this is related to PyTorch version or not. I used Python 3.7, PyTorch 1.6.0, CUDA 10.1

bowenc0221 avatar Jan 15 '21 19:01 bowenc0221

@erikwijmans Kindly please help us with this issue. Thanks

sheshap avatar Mar 03 '21 14:03 sheshap

Hi, I'm not sure, but you can change the value of "patience" parameter in early_stop_callback, if you comment the line, the code will keep running until the end of the epoch.

#early_stop_callback = pl.callbacks.EarlyStopping(patience=5) checkpoint_callback = pl.callbacks.ModelCheckpoint( monitor="val_acc", mode="max", save_top_k=2, filepath=os.path.join( cfg.task_model.name, "{epoch}-{val_loss:.2f}-{val_acc:.3f}" ), verbose=True, ) trainer = pl.Trainer( gpus=list(cfg.gpus), max_epochs=cfg.epochs, #early_stop_callback=early_stop_callback, checkpoint_callback=checkpoint_callback, distributed_backend=cfg.distrib_backend, ) line 33 from train.py (my English is limited...)

Jcxloyal avatar Mar 18 '21 07:03 Jcxloyal