TQDMProgressBar bug (?) causes DDP to hang
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/callbacks/progress/tqdm_progress.py#L285-L286
Is this^ a safe implementation for DDP?
Based on my understanding, this will lead to the following stack trace, ultimately leading to a sync operation on ResultMetrics if certain conditions are met. However, self.train_progress_bar.disable is only False (i.e. enabled) on the rank-zero process, so the metrics computations will only succeed if there happens to be another codepath on non-rank-zero devices that's requesting self._logger_connector.metrics at the same time.
Stack trace
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/callbacks/progress/progress_bar.py#L180-L201
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/trainer.py#L1675-L1683
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py#L253-L258
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py#L232-L237
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/result.py#L471-L476
https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/result.py#L425-L440
I ran into this when debugging an issue in my training script where DDP was hanging at the end of the first epoch. Changing the original code chunk as such resolved the deadlock for me.
metrics = self.get_metrics(trainer, pl_module)
if not self.train_progress_bar.disable:
self.train_progress_bar.set_postfix(metrics)
To provide a little more detail, I have a CustomCallback that calls trainer.strategy.barrier() during on_train_epoch_end. At the end of the first training epoch, the rank-zero process hangs on TQDMProgressBar.on_train_epoch_end and the non-rank-zero processes hang on CustomCallback.on_train_epoch_end.
cc @justusschock @lantiga
Hi @tchittesh , I tried running a multi-GPU DDP setup with a similar callback and TQDMProgressBar, but I wasnβt able to reproduce the hanging issue.
Hereβs my code:
import lightning as L
from lightning.pytorch.callbacks import TQDMProgressBar
from lightning.pytorch.demos.boring_classes import BoringModel
class MyModel(BoringModel):
def __init__(self):
super().__init__()
self.latest_loss = 0
def training_step(self, batch, batch_idx):
loss = super().training_step(batch, batch_idx)["loss"]
self.log("epoch_loss", loss, on_step=False, on_epoch=True, sync_dist=True)
self.latest_loss = loss.item()
return {"loss": loss}
class Bar(TQDMProgressBar):
def on_train_epoch_end(self, trainer, pl_module):
# barrier before progress bar update
trainer.strategy.barrier()
super().on_train_epoch_end(trainer, pl_module)
def get_metrics(self, trainer, model):
items = super().get_metrics(trainer, model)
items["loss"] = model.latest_loss
items.pop("v_num", None)
return items
if __name__ == "__main__":
bar = Bar()
model = MyModel()
trainer = L.Trainer(
strategy="ddp",
accelerator="gpu",
devices=4,
max_epochs=10,
callbacks=[bar],
log_every_n_steps=1,
)
trainer.fit(model)
- output on T4 machine with 4 gpus:
β‘ ~ python main.py
π‘ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params | Mode
-----------------------------------------
0 | layer | Linear | 66 | train
-----------------------------------------
66 Trainable params
0 Non-trainable params
66 Total params
0.000 Total estimated model params size (MB)
1 Modules in train mode
0 Modules in eval mode
Sanity Checking: | | 0/? [00:00<?, ?it/s]/home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
Epoch 9: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16/16 [00:00<00:00, 214.59it/s, loss=0.0919]`Trainer.fit` stopped: `max_epochs=10` reached.
Epoch 9: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 16/16 [00:00<00:00, 203.04it/s, loss=0.0919]
β‘ ~
Am I missing something in the setup that triggers the deadlock?
Here's a minimal reproduction of the deadlock (and the fix for it).
Some notes:
- It's important to instantiate
CustomCallbackinLightningModule.configure_callbacks()instead of directly passing it to the trainer.- If we pass it into the trainer, the order of callbacks is
CustomCallback -> TQDMProgressBar -> ModelCheckpoint. In this case, theCustomCallbackwill execute correctly. Then, rank 0 will try to sync/computeResultMetricsfor theTQDMProgressBar. Rank 1 will skip pastTQDMProgressBar, but it'll get lucky becauseModelCheckpointwill also request to sync/compute the sameResultMetrics. - Passing it into
LightningModule.configure_callbacks()will change the order of callbacks toTQDMProgressBar -> CustomCallback -> ModelCheckpoint. In this case, rank 0 will try to computeResultMetricsfor theTQDMProgressBar, but rank 1 will skip past theTQDMProgressBarand get stuck at thedist.gather_objectsync point inCustomCallback.
- If we pass it into the trainer, the order of callbacks is
import torch.distributed as dist
import pytorch_lightning as pl
import pytorch_lightning.callbacks
from pytorch_lightning.demos.boring_classes import BoringModel
def on_train_epoch_end_monkey_patch(self, trainer, pl_module):
metrics = self.get_metrics(trainer, pl_module)
if not self.train_progress_bar.disable:
self.train_progress_bar.set_postfix(metrics)
if self._leave:
self.train_progress_bar.close()
# Uncomment this monkey patch to fix the deadlock:
# pytorch_lightning.callbacks.TQDMProgressBar.on_train_epoch_end = on_train_epoch_end_monkey_patch
class MyModel(BoringModel):
def training_step(self, batch, batch_idx):
loss = super().training_step(batch, batch_idx)["loss"]
self.log("loss", loss, on_step=False, on_epoch=True, sync_dist=True)
return {"loss": loss}
def configure_callbacks(self):
return [CustomCallback()]
class CustomCallback(pl.Callback):
def on_train_epoch_end(self, trainer, pl_module):
obj = 10
objs = [None] * trainer.world_size
dist.gather_object(
obj,
objs if pl_module.global_rank == 0 else None,
dst=0,
)
if __name__ == "__main__":
model = MyModel()
trainer = pl.Trainer(
strategy="ddp",
accelerator="gpu",
devices=2,
max_epochs=10,
)
trainer.fit(model)
Hi @tchittesh, thanks for the detailed repro! Another way to fix this DDP hang is to adjust _reorder_callbacks():
tuner_callbacks -> other_callbacks -> progress_bar_callbacks -> checkpoint_callbacks.
https://github.com/Lightning-AI/pytorch-lightning/blob/4dc1dbce6365f15d1ec8b42410fc79df86d748fa/src/lightning/pytorch/trainer/connectors/callback_connector.py#L214-L239
to be something like:
@staticmethod
def _reorder_callbacks(callbacks: list[Callback]) -> list[Callback]:
tuner_callbacks: list[Callback] = []
other_callbacks: list[Callback] = []
progress_bar_callbacks: list[Callback] = []
checkpoint_callbacks: list[Callback] = []
for cb in callbacks:
if isinstance(cb, (BatchSizeFinder, LearningRateFinder)):
tuner_callbacks.append(cb)
elif isinstance(cb, Checkpoint):
checkpoint_callbacks.append(cb)
elif isinstance(cb, ProgressBar):
progress_bar_callbacks.append(cb)
else:
other_callbacks.append(cb)
return tuner_callbacks + other_callbacks + progress_bar_callbacks + checkpoint_callbacks
This preserves the relative order within each group and avoids rank 0 being blocked while syncing metrics. This approach also fixes the bug. Any concerns or edge cases I might be missing?
cc: @SkafteNicki
@deependujha yeah, i think just changing the the _reorder_callbacks method is a good strategy for solving this issue. I cannot really see a edge case where this change will be a problem. Please go ahead and send a PR.
I don't understand. Isn't this just hiding the bug in TQDMProgressBar? If someone later comes in and disables the ModelCheckpoint callback, won't they hit the same deadlock? Could you provide some more color on why reordering callbacks is a better solution than computing metrics on all ranks?
works fine even with enable_checkpointing=False.
@tchittesh @deependujha lets start out with a PR that implement the appropriate test e.g. something that would fail with the current codebase and then implement the fix to make sure that it indeed fixes the error
Agreed that we should add a test here.
However, I strongly believe that reordering callbacks is not a good fix. Using some pseudo-code for simplicity, there's a
if global_rank == 0:
sync_and_compute_result_metrics()
block in TQDMProgressBar, which will deadlock unless non-rank 0 processes also call sync_and_compute_result_metrics() somewhere else. This means that in isolation, TQDMProgressBar.on_train_epoch_end() will always hang with DDP!
We typically don't observe this on master, because the ModelCheckpoint callback tries to sync_and_compute_result_metrics() as well, on non-rank 0 processes. Similarly, with the proposed reordering fix, the deadlock "goes away" because LoggerConnector also runs sync_and_compute_result_metrics() on non-rank 0 processes. However, the correctness of TQDMProgressBar ideally shouldn't depend on the presence of other hooks that get executed afterwards.
Let me know if I'm overlooking something -- I hope we can reach some consensus on this issue.
@tchittesh I totally get your logic and maybe the right thing here is to compute metrics on all ranks. I hopefully have time later this week to deep dive into the code to figure if there is something we are overlooking.
I guess what I tried to state in my previous message is that we should be able to define a test where in isolation this problem is triggered. It probably involves disabling ModelCheckpoint and LoggerConnector. In this case, if you are correct, the reordering should still not work and we can conclude computing on all ranks is the correct way.
Hi @SkafteNicki ! I have a local branch that adds targeted tests and fixes this issue. How can I push my branch to this repo and open a PR?
β― git push -u origin bugfix/21264_pbardeadlock
remote: Permission to Lightning-AI/pytorch-lightning.git denied to tchittesh.
fatal: unable to access 'https://github.com/Lightning-AI/pytorch-lightning.git/': The requested URL returned error: 403
@tchittesh the right procedure is that you need to fork the project first. That fork of the project you have full push privileges to. In that fork you can then recreate your branch and push (or you can work directly on master/main since it is your fork). After pushing you should be able to create PR between your forked repo branch and master on lightning.
You should be able to quickly do it by just adding your fork as a remote on your laptop and then push to that instead
# 1. Make sure youβre on the correct branch
git checkout bugfix/21264_pbardeadlock
# 2. Add your fork as a new remote (replace USERNAME with your GitHub username)
git remote add fork https://github.com/USERNAME/pytorch-lightning.git
# 3. Verify remotes (you should now see both origin and fork)
git remote -v
# 4. Push your branch to your fork
git push -u fork bugfix/21264_pbardeadlock
awesome! I opened a PR above^.
I'm not sure if the CI failures are related to my changes. AFAICT, those tests have enable_progress_bar=False
recently rebased, and the tests pass! @SkafteNicki if you could take a look when you get the chance, that would be great, thank you! https://github.com/Lightning-AI/pytorch-lightning/pull/21322