ColossalAI icon indicating copy to clipboard operation
ColossalAI copied to clipboard

multi-host parallel training cannot recognize all gpu

Open zhaoanbei opened this issue 3 years ago • 1 comments

Hi I am following https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion and trying to use 2 hosts with each 1 T4 GPU. I changed lightning: trainer: accelerator: 'gpu' devices: 2

And had error:

lightning.fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1] But your machine only has: [0] lightning.fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1] But your machine only has: [0]

Is there other config I should assign to recognize multi-host?

zhaoanbei avatar Mar 03 '23 01:03 zhaoanbei

We have fixed it and will release a detailed blog with the AWS team together. Thanks.

binmakeswell avatar Mar 07 '23 05:03 binmakeswell

https://aws.amazon.com/cn/blogs/china/run-colossal-ai-based-distributed-finetune-tasks-on-sagemaker/ We are happy to release this blog together. Thanks.

binmakeswell avatar Apr 27 '23 08:04 binmakeswell