pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

Fix ddp_notebook CUDA fork check to allow passive initialization

Open arrdel opened this issue 4 months ago • 1 comments

What does this PR do?

Fixes #21389

This PR fixes the overly strict CUDA fork check in ddp_notebook strategy that was causing false positives in notebook environments like Kaggle.

Problem

The previous implementation used torch.cuda.is_initialized() which returns True even when CUDA is passively initialized (e.g., during library imports, device availability checks, or model loading). This caused the error:

RuntimeError: Lightning can't create new processes if CUDA is already initialized.

This happened even when users didn't explicitly call any CUDA functions, making it impossible to use ddp_notebook in many legitimate scenarios.

Solution

This fix uses PyTorch's internal torch.cuda._is_in_bad_fork() function, which more accurately detects when we're in an actual bad fork state.

The implementation includes a fallback to the old check for older PyTorch versions that don't have _is_in_bad_fork.

Testing

  • [x] Code follows style guidelines
  • [x] Changes preserve backward compatibility
  • [x] Fallback exists for older PyTorch versions

📚 Documentation preview 📚: https://pytorch-lightning--21402.org.readthedocs.build/en/21402/

arrdel avatar Dec 03 '25 20:12 arrdel

Codecov Report

:x: Patch coverage is 0% with 11 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 79%. Comparing base (79ffe50) to head (f002d00). :warning: Report is 13 commits behind head on master. :white_check_mark: All tests successful. No failed tests found.

:exclamation: There is a different number of reports uploaded between BASE (79ffe50) and HEAD (f002d00). Click for more details.

HEAD has 3345 uploads less than BASE
Flag BASE (79ffe50) HEAD (f002d00)
cpu 777 30
lightning_fabric 195 0
pytest 390 0
python3.12 233 9
python3.12.7 232 9
lightning 388 15
python3.11 156 6
python3.10 78 3
python 78 3
pytorch2.1 78 6
pytest-full 387 30
pytorch_lightning 194 15
pytorch2.6 39 3
pytorch2.4.1 38 3
pytorch2.3 39 3
pytorch2.2.2 39 3
pytorch2.5.1 38 3
pytorch2.9 39 3
pytorch2.7 39 3
pytorch2.8 38 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21402     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         269      266      -3     
  Lines       23804    23772     -32     
=========================================
- Hits        20626    18730   -1896     
- Misses       3178     5042   +1864     

codecov[bot] avatar Dec 05 '25 10:12 codecov[bot]

Thanks @arrdel , could you update the changelogs? Other than that your PR seems fine to me :)

justusschock avatar Dec 17 '25 13:12 justusschock

Thanks @justusschock! I've updated the Fabric changelog as requested. The entry documents the fix for the DDP notebook CUDA fork check to allow passive initialization. The changelog is now complete and ready for review.

arrdel avatar Dec 17 '25 18:12 arrdel