StableCascade icon indicating copy to clipboard operation
StableCascade copied to clipboard

How to set "use_fsdp=True" with "SLURM_LOCALID" and "SLURM_PROCID" for multi-gpus training?

Open terrificdm opened this issue 1 year ago • 7 comments

For single GPU training, every time I run the script, I have to "export SLURM_LOCALID=0", "export SLURM_PROCID=0" and "export SLURM_NNODES=1" before I start the training successfully. My question is for multi-gpus training(supposed I have 4 GPUS), how can I set "use_fsdp=True" together with "SLURM_LOCALID" and "SLURM_PROCID"? If I use former configurations for single GPU(e.g. export SLURM_LOCALID=0), the training halt forever without any notification. Can someone give me an example for configurations? Thanks.

terrificdm avatar Feb 18 '24 15:02 terrificdm

I also want to know this. When I heard there was fsdp I was hoping maybe I could train by splitting between my RTX 3090 and RTX 4060TI 16GB. However the SLURM errors prevent me from getting the trainer to recognize both GPUs, and it's clear that I won't be able to train these models effectively with the 3090 alone.

Goldenkoron avatar Feb 18 '24 20:02 Goldenkoron

  1. Run training script with torchrun instead of srun (e.g., torchrun --standalone --nnodes 1 --nproc_per_node 2 train/~~~~)
  2. Exchange SLURM_LOCALID, SLURM_PROCID, SLURM_NNODES to LOCAL_RANK, RANK, WORLD_SIZE in your codes def setup_ddp(self, experiment_id, single_gpu=False): if not single_gpu: local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE")) #* torch.cuda.device_count()

I followed this but it seems not working, the training still get stuck

universewill avatar Feb 22 '24 13:02 universewill

my cpu is working very hard according to top. am I training on cpu now for some reason? turned on FSDP. running on 2 A100s w cuda.

heyalexchoi avatar Mar 01 '24 22:03 heyalexchoi

  1. Run training script with torchrun instead of srun (e.g., torchrun --standalone --nnodes 1 --nproc_per_node 2 train/~~~~)
  2. Exchange SLURM_LOCALID, SLURM_PROCID, SLURM_NNODES to LOCAL_RANK, RANK, WORLD_SIZE in your codes def setup_ddp(self, experiment_id, single_gpu=False): if not single_gpu: local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE")) #* torch.cuda.device_count()

I followed this but it seems not working, the training still get stuck I also found my training process gets stuck. And I am also confused about how to use srun to perform the training process.

Unified-Robots avatar Mar 02 '24 15:03 Unified-Robots

2. local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE"))

Have you solved the problem?

zoumaguanxin avatar Mar 06 '24 09:03 zoumaguanxin

also stuck at training by using srun

rese1f avatar Mar 11 '24 03:03 rese1f

  1. Run training script with torchrun instead of srun (e.g., torchrun --standalone --nnodes 1 --nproc_per_node 2 train/~~~~)
  2. Exchange SLURM_LOCALID, SLURM_PROCID, SLURM_NNODES to LOCAL_RANK, RANK, WORLD_SIZE in your codes def setup_ddp(self, experiment_id, single_gpu=False): if not single_gpu: local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE")) #* torch.cuda.device_count()

I followed this but it seems not working, the training still get stuck

Tried this and it's working for me. Don't forget to change SLURM_LOCALID string into LOCAL_RANK in your train script.

gambar

FYI: The step #2 is located on /core/init.py

tikhonlavrev avatar Jun 04 '24 17:06 tikhonlavrev