DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

fix environment variable export bug for MultiNodeRunner

Open TideDra opened this issue 1 year ago • 0 comments

In some multi-node environment like SLURM,there are some environment vars that contain special chars and can trigger errors when being exported.

For example, there is a var SLURM_JOB_CPUS_PER_NODE=64(x2) when requesting two nodes with 64 cpus using SLURM. Using runner.add_export to export this var will add a command export SLURM_JOB_CPUS_PER_NODE=64(x2) when launching subprocesses, while this will cause a bash error since ( is a key word of bash, like:

[2024-08-07 16:56:24,651] [INFO] [runner.py:568:main] cmd = pdsh -S -f 1024 -w server22,server27 export PYTHONPATH=/public/home/grzhang/code/CLIP-2;  export SLURM_JOB_CPUS_PER_NODE=64(x2); ...
server22: bash: -c: 行 0: 未预期的符号“(”附近有语法错误

This PR simply wrap the environment vars with a pair of " to make sure they are treated as string.

TideDra avatar Aug 08 '24 05:08 TideDra