Open-Sora icon indicating copy to clipboard operation
Open-Sora copied to clipboard

multi node training with slurm

Open rob-hen opened this issue 1 year ago • 1 comments

How can I do multi node training using slurm?

I read that I have to add some code lines from here.

I added to the main function before dist.init_process_group

import colossalai

colossalai.launch_from_slurm(
    host=master_host,
    port=29500
)

for a hard-coded master_host, that is part of the assigned nodes. Then in the slurm file I do srun python scripts/train.py config my_config.py

However, it does not work. The code is just stuck.

Can someone explain me how to do it?

rob-hen avatar Aug 06 '24 13:08 rob-hen

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Aug 14 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Aug 22 '24 01:08 github-actions[bot]