The classification training exits after 26th epoch
I just downloaded the code and followed the instructions (installed Python 3.6, Pytorch 1.5 with CUDA 10.2)
And when I started training using Multiple GPUs script the training abruptly exited without any error after the 26th epoch
Epoch 26: : 450it [00:19, 23.47it/s, loss=0.359, train_acc=0.875, v_num=1, val_acc=0.897, val_loss=0.323]
I am using 4 RTX 2080 Ti GPUs.
I have similar issue and the same versions. Did you manage to solve it?
I got the same issue, with single GPU (I tried both 1080TI and 2080TI). Not sure if this is related to PyTorch version or not. I used Python 3.7, PyTorch 1.6.0, CUDA 10.1
@erikwijmans Kindly please help us with this issue. Thanks
Hi, I'm not sure, but you can change the value of "patience" parameter in early_stop_callback, if you comment the line, the code will keep running until the end of the epoch.
#early_stop_callback = pl.callbacks.EarlyStopping(patience=5) checkpoint_callback = pl.callbacks.ModelCheckpoint( monitor="val_acc", mode="max", save_top_k=2, filepath=os.path.join( cfg.task_model.name, "{epoch}-{val_loss:.2f}-{val_acc:.3f}" ), verbose=True, ) trainer = pl.Trainer( gpus=list(cfg.gpus), max_epochs=cfg.epochs, #early_stop_callback=early_stop_callback, checkpoint_callback=checkpoint_callback, distributed_backend=cfg.distrib_backend, )
line 33 from train.py
(my English is limited...)