ColossalAI
ColossalAI copied to clipboard
multi-host parallel training cannot recognize all gpu
Hi I am following https://github.com/hpcaitech/ColossalAI/tree/main/examples/images/diffusion and trying to use 2 hosts with each 1 T4 GPU. I changed lightning: trainer: accelerator: 'gpu' devices: 2
And had error:
| lightning.fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1] But your machine only has: [0] | lightning.fabric.utilities.exceptions.MisconfigurationException: You requested gpu: [0, 1] But your machine only has: [0] |
|---|
Is there other config I should assign to recognize multi-host?
We have fixed it and will release a detailed blog with the AWS team together. Thanks.
https://aws.amazon.com/cn/blogs/china/run-colossal-ai-based-distributed-finetune-tasks-on-sagemaker/ We are happy to release this blog together. Thanks.