Open-Sora
Open-Sora copied to clipboard
multi node training with slurm
How can I do multi node training using slurm?
I read that I have to add some code lines from here.
I added to the main function before dist.init_process_group
import colossalai
colossalai.launch_from_slurm(
host=master_host,
port=29500
)
for a hard-coded master_host, that is part of the assigned nodes. Then in the slurm file I do
srun python scripts/train.py config my_config.py
However, it does not work. The code is just stuck.
Can someone explain me how to do it?
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 7 days since being marked as stale.