pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

TQDMProgressBar bug (?) causes DDP to hang

Open tchittesh opened this issue 6 months ago β€’ 14 comments

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/callbacks/progress/tqdm_progress.py#L285-L286

Is this^ a safe implementation for DDP?

Based on my understanding, this will lead to the following stack trace, ultimately leading to a sync operation on ResultMetrics if certain conditions are met. However, self.train_progress_bar.disable is only False (i.e. enabled) on the rank-zero process, so the metrics computations will only succeed if there happens to be another codepath on non-rank-zero devices that's requesting self._logger_connector.metrics at the same time.

Stack trace

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/callbacks/progress/progress_bar.py#L180-L201

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/trainer.py#L1675-L1683

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py#L253-L258

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py#L232-L237

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/result.py#L471-L476

https://github.com/Lightning-AI/pytorch-lightning/blob/918a1a62635e41b72131fc153920594f0f64cc5b/src/lightning/pytorch/trainer/connectors/logger_connector/result.py#L425-L440

I ran into this when debugging an issue in my training script where DDP was hanging at the end of the first epoch. Changing the original code chunk as such resolved the deadlock for me.

metrics = self.get_metrics(trainer, pl_module)
if not self.train_progress_bar.disable: 
    self.train_progress_bar.set_postfix(metrics) 

To provide a little more detail, I have a CustomCallback that calls trainer.strategy.barrier() during on_train_epoch_end. At the end of the first training epoch, the rank-zero process hangs on TQDMProgressBar.on_train_epoch_end and the non-rank-zero processes hang on CustomCallback.on_train_epoch_end.

cc @justusschock @lantiga

tchittesh avatar Oct 04 '25 01:10 tchittesh

Hi @tchittesh , I tried running a multi-GPU DDP setup with a similar callback and TQDMProgressBar, but I wasn’t able to reproduce the hanging issue.

Here’s my code:
import lightning as L
from lightning.pytorch.callbacks import TQDMProgressBar
from lightning.pytorch.demos.boring_classes import BoringModel

class MyModel(BoringModel):
    def __init__(self):
        super().__init__()
        self.latest_loss = 0

    def training_step(self, batch, batch_idx):
        loss = super().training_step(batch, batch_idx)["loss"]
        self.log("epoch_loss", loss, on_step=False, on_epoch=True, sync_dist=True)
        self.latest_loss = loss.item()
        return {"loss": loss}

class Bar(TQDMProgressBar):
    def on_train_epoch_end(self, trainer, pl_module):
        # barrier before progress bar update
        trainer.strategy.barrier()
        super().on_train_epoch_end(trainer, pl_module)

    def get_metrics(self, trainer, model):
        items = super().get_metrics(trainer, model)
        items["loss"] = model.latest_loss
        items.pop("v_num", None)
        return items

if __name__ == "__main__":
    bar = Bar()
    model = MyModel()
    trainer = L.Trainer(
        strategy="ddp",
        accelerator="gpu",
        devices=4,
        max_epochs=10,
        callbacks=[bar],
        log_every_n_steps=1,
    )
    trainer.fit(model)

  • output on T4 machine with 4 gpus:
⚑ ~ python main.py
πŸ’‘ Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------

LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

  | Name  | Type   | Params | Mode 
-----------------------------------------
0 | layer | Linear | 66     | train
-----------------------------------------
66        Trainable params
0         Non-trainable params
66        Total params
0.000     Total estimated model params size (MB)
1         Modules in train mode
0         Modules in eval mode
Sanity Checking: |                                                                                                                                                                      | 0/? [00:00<?, ?it/s]/home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
/home/zeus/miniconda3/envs/cloudspace/lib/python3.12/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:433: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=7` in the `DataLoader` to improve performance.
Epoch 9: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:00<00:00, 214.59it/s, loss=0.0919]`Trainer.fit` stopped: `max_epochs=10` reached.                                                                                                                                                               
Epoch 9: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 16/16 [00:00<00:00, 203.04it/s, loss=0.0919]
⚑ ~ 

Am I missing something in the setup that triggers the deadlock?

deependujha avatar Oct 04 '25 08:10 deependujha

Here's a minimal reproduction of the deadlock (and the fix for it).

Some notes:

  • It's important to instantiate CustomCallback in LightningModule.configure_callbacks() instead of directly passing it to the trainer.
    • If we pass it into the trainer, the order of callbacks is CustomCallback -> TQDMProgressBar -> ModelCheckpoint. In this case, the CustomCallback will execute correctly. Then, rank 0 will try to sync/compute ResultMetrics for the TQDMProgressBar. Rank 1 will skip past TQDMProgressBar, but it'll get lucky because ModelCheckpoint will also request to sync/compute the same ResultMetrics.
    • Passing it into LightningModule.configure_callbacks() will change the order of callbacks to TQDMProgressBar -> CustomCallback -> ModelCheckpoint. In this case, rank 0 will try to compute ResultMetrics for the TQDMProgressBar, but rank 1 will skip past the TQDMProgressBar and get stuck at the dist.gather_object sync point in CustomCallback.
import torch.distributed as dist
import pytorch_lightning as pl
import pytorch_lightning.callbacks
from pytorch_lightning.demos.boring_classes import BoringModel

def on_train_epoch_end_monkey_patch(self, trainer, pl_module):
    metrics = self.get_metrics(trainer, pl_module)
    if not self.train_progress_bar.disable:
        self.train_progress_bar.set_postfix(metrics)
    if self._leave:
        self.train_progress_bar.close()

# Uncomment this monkey patch to fix the deadlock:
# pytorch_lightning.callbacks.TQDMProgressBar.on_train_epoch_end = on_train_epoch_end_monkey_patch

class MyModel(BoringModel):
    def training_step(self, batch, batch_idx):
        loss = super().training_step(batch, batch_idx)["loss"]
        self.log("loss", loss, on_step=False, on_epoch=True, sync_dist=True)
        return {"loss": loss}

    def configure_callbacks(self):
        return [CustomCallback()]

class CustomCallback(pl.Callback):
    def on_train_epoch_end(self, trainer, pl_module):
        obj = 10
        objs = [None] * trainer.world_size
        dist.gather_object(
            obj,
            objs if pl_module.global_rank == 0 else None,
            dst=0,
        )

if __name__ == "__main__":
    model = MyModel()
    trainer = pl.Trainer(
        strategy="ddp",
        accelerator="gpu",
        devices=2,
        max_epochs=10,
    )
    trainer.fit(model)

tchittesh avatar Oct 04 '25 23:10 tchittesh

Hi @tchittesh, thanks for the detailed repro! Another way to fix this DDP hang is to adjust _reorder_callbacks():

tuner_callbacks -> other_callbacks -> progress_bar_callbacks -> checkpoint_callbacks.

https://github.com/Lightning-AI/pytorch-lightning/blob/4dc1dbce6365f15d1ec8b42410fc79df86d748fa/src/lightning/pytorch/trainer/connectors/callback_connector.py#L214-L239

to be something like:

@staticmethod
    def _reorder_callbacks(callbacks: list[Callback]) -> list[Callback]:
        
        tuner_callbacks: list[Callback] = []
        other_callbacks: list[Callback] = []
        progress_bar_callbacks: list[Callback] = []
        checkpoint_callbacks: list[Callback] = []

        for cb in callbacks:
            if isinstance(cb, (BatchSizeFinder, LearningRateFinder)):
                tuner_callbacks.append(cb)
            elif isinstance(cb, Checkpoint):
                checkpoint_callbacks.append(cb)
            elif isinstance(cb, ProgressBar):
                progress_bar_callbacks.append(cb)
            else:
                other_callbacks.append(cb)

        return tuner_callbacks + other_callbacks + progress_bar_callbacks + checkpoint_callbacks

This preserves the relative order within each group and avoids rank 0 being blocked while syncing metrics. This approach also fixes the bug. Any concerns or edge cases I might be missing?

cc: @SkafteNicki

deependujha avatar Oct 06 '25 05:10 deependujha

@deependujha yeah, i think just changing the the _reorder_callbacks method is a good strategy for solving this issue. I cannot really see a edge case where this change will be a problem. Please go ahead and send a PR.

SkafteNicki avatar Oct 06 '25 06:10 SkafteNicki

I don't understand. Isn't this just hiding the bug in TQDMProgressBar? If someone later comes in and disables the ModelCheckpoint callback, won't they hit the same deadlock? Could you provide some more color on why reordering callbacks is a better solution than computing metrics on all ranks?

tchittesh avatar Oct 06 '25 08:10 tchittesh

works fine even with enable_checkpointing=False.

deependujha avatar Oct 06 '25 08:10 deependujha

@tchittesh @deependujha lets start out with a PR that implement the appropriate test e.g. something that would fail with the current codebase and then implement the fix to make sure that it indeed fixes the error

SkafteNicki avatar Oct 06 '25 11:10 SkafteNicki

Agreed that we should add a test here.

However, I strongly believe that reordering callbacks is not a good fix. Using some pseudo-code for simplicity, there's a

if global_rank == 0:
    sync_and_compute_result_metrics()

block in TQDMProgressBar, which will deadlock unless non-rank 0 processes also call sync_and_compute_result_metrics() somewhere else. This means that in isolation, TQDMProgressBar.on_train_epoch_end() will always hang with DDP!

We typically don't observe this on master, because the ModelCheckpoint callback tries to sync_and_compute_result_metrics() as well, on non-rank 0 processes. Similarly, with the proposed reordering fix, the deadlock "goes away" because LoggerConnector also runs sync_and_compute_result_metrics() on non-rank 0 processes. However, the correctness of TQDMProgressBar ideally shouldn't depend on the presence of other hooks that get executed afterwards.

Let me know if I'm overlooking something -- I hope we can reach some consensus on this issue.

tchittesh avatar Oct 06 '25 19:10 tchittesh

@tchittesh I totally get your logic and maybe the right thing here is to compute metrics on all ranks. I hopefully have time later this week to deep dive into the code to figure if there is something we are overlooking.

I guess what I tried to state in my previous message is that we should be able to define a test where in isolation this problem is triggered. It probably involves disabling ModelCheckpoint and LoggerConnector. In this case, if you are correct, the reordering should still not work and we can conclude computing on all ranks is the correct way.

SkafteNicki avatar Oct 07 '25 05:10 SkafteNicki

Hi @SkafteNicki ! I have a local branch that adds targeted tests and fixes this issue. How can I push my branch to this repo and open a PR?

❯ git push -u origin bugfix/21264_pbardeadlock
remote: Permission to Lightning-AI/pytorch-lightning.git denied to tchittesh.
fatal: unable to access 'https://github.com/Lightning-AI/pytorch-lightning.git/': The requested URL returned error: 403

tchittesh avatar Oct 29 '25 04:10 tchittesh

@tchittesh the right procedure is that you need to fork the project first. That fork of the project you have full push privileges to. In that fork you can then recreate your branch and push (or you can work directly on master/main since it is your fork). After pushing you should be able to create PR between your forked repo branch and master on lightning.

SkafteNicki avatar Oct 29 '25 05:10 SkafteNicki

You should be able to quickly do it by just adding your fork as a remote on your laptop and then push to that instead

# 1. Make sure you’re on the correct branch
git checkout bugfix/21264_pbardeadlock

# 2. Add your fork as a new remote (replace USERNAME with your GitHub username)
git remote add fork https://github.com/USERNAME/pytorch-lightning.git

# 3. Verify remotes (you should now see both origin and fork)
git remote -v

# 4. Push your branch to your fork
git push -u fork bugfix/21264_pbardeadlock

SkafteNicki avatar Oct 29 '25 05:10 SkafteNicki

awesome! I opened a PR above^.

I'm not sure if the CI failures are related to my changes. AFAICT, those tests have enable_progress_bar=False

Image Image

tchittesh avatar Oct 30 '25 06:10 tchittesh

recently rebased, and the tests pass! @SkafteNicki if you could take a look when you get the chance, that would be great, thank you! https://github.com/Lightning-AI/pytorch-lightning/pull/21322

tchittesh avatar Nov 10 '25 06:11 tchittesh