Wu Houming
Wu Houming
When I run “torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:29500 --nnodes=1 --nproc-per-node=4 test_pipeline_schedule.py --schedules gpipe”,I got the following outputs: ```shell [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] ***************************************** [2023-12-03 08:40:53,722] torch.distributed.run: [WARNING] Setting...
I tried to change the slurm script (i.e., prof_steps.sh) to torchrun and ran it directly, but encountered a stuck issue with NCCL as collective_backend. The torchran script is as follows:...