How to set "use_fsdp=True" with "SLURM_LOCALID" and "SLURM_PROCID" for multi-gpus training?
For single GPU training, every time I run the script, I have to "export SLURM_LOCALID=0", "export SLURM_PROCID=0" and "export SLURM_NNODES=1" before I start the training successfully. My question is for multi-gpus training(supposed I have 4 GPUS), how can I set "use_fsdp=True" together with "SLURM_LOCALID" and "SLURM_PROCID"? If I use former configurations for single GPU(e.g. export SLURM_LOCALID=0), the training halt forever without any notification. Can someone give me an example for configurations? Thanks.
I also want to know this. When I heard there was fsdp I was hoping maybe I could train by splitting between my RTX 3090 and RTX 4060TI 16GB. However the SLURM errors prevent me from getting the trainer to recognize both GPUs, and it's clear that I won't be able to train these models effectively with the 3090 alone.
- Run training script with torchrun instead of srun (e.g., torchrun --standalone --nnodes 1 --nproc_per_node 2 train/~~~~)
- Exchange SLURM_LOCALID, SLURM_PROCID, SLURM_NNODES to LOCAL_RANK, RANK, WORLD_SIZE in your codes def setup_ddp(self, experiment_id, single_gpu=False): if not single_gpu: local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE")) #* torch.cuda.device_count()
I followed this but it seems not working, the training still get stuck
my cpu is working very hard according to top. am I training on cpu now for some reason? turned on FSDP. running on 2 A100s w cuda.
- Run training script with torchrun instead of srun (e.g., torchrun --standalone --nnodes 1 --nproc_per_node 2 train/~~~~)
- Exchange SLURM_LOCALID, SLURM_PROCID, SLURM_NNODES to LOCAL_RANK, RANK, WORLD_SIZE in your codes def setup_ddp(self, experiment_id, single_gpu=False): if not single_gpu: local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE")) #* torch.cuda.device_count()
I followed this but it seems not working, the training still get stuck I also found my training process gets stuck. And I am also confused about how to use srun to perform the training process.
2. local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE"))
Have you solved the problem?
also stuck at training by using srun
- Run training script with torchrun instead of srun (e.g., torchrun --standalone --nnodes 1 --nproc_per_node 2 train/~~~~)
- Exchange SLURM_LOCALID, SLURM_PROCID, SLURM_NNODES to LOCAL_RANK, RANK, WORLD_SIZE in your codes def setup_ddp(self, experiment_id, single_gpu=False): if not single_gpu: local_rank = int(os.environ.get("LOCAL_RANK")) process_id = int(os.environ.get("RANK")) world_size = int(os.environ.get("WORLD_SIZE")) #* torch.cuda.device_count()
I followed this but it seems not working, the training still get stuck
Tried this and it's working for me. Don't forget to change SLURM_LOCALID string into LOCAL_RANK in your train script.
FYI: The step #2 is located on /core/init.py