DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

Use deepspeed on different servers with different cuda version

Open yt2639 opened this issue 2 years ago • 0 comments

Hi friends, I have issues using deepspeed on different servers with different cuda version.

I installed deepspeed via pip install deepspeed and the version is 0.5.8.

So basically the servers share the same conda environment. On one of the servers, it has cuda version of 11.7 and I also exactly installed PyTorch 1.13.1 w/ cuda 11.7 in the shared conda environment. So deepspeed works very smoothly on this specific server. But when I want to run deepspeed on other servers with different cuda versions, for example, cuda 11.8, 11.6 or 12.0, then it reports this error:

Exception: Installed CUDA version 11.8 does not match the version torch was compiled with 11.7, unable to compile cuda/cpp extensions without a matching cuda version.

Technically I can create different conda environments for different servers and then install the PyTorch from source so that PyTorch is compiled using the corresponding cuda version on each server. But this is extremely inconvenient. And also as I only installed PyTorch 1.13.1 with cuda 11.7, however PyTorch works very well on other servers (with cuda 11.8, 11.6, 12.0). So I am thinking deepspeed probably has similar compatible support for different cuda versions. Therefore, is there any tricky solution around to solve this issue?

Thanks! Shane

yt2639 avatar Mar 14 '23 08:03 yt2639