DeepSpeed CPU Support
@ShadenSmith
Enables:
- Training/inference for most workloads on CPU systems
- DeepSpeed development on systems without GPUs (including personal machines) to save compute resources
Most code changes boil down to guards within torch.cuda.is_available(), especially within the tests.
The following features don't work on CPU currently (But should. Working on it):
- [ ] - Zero stage 3
- [ ] - aio
The following features don't work on CPU (and shouldn't):
- Fused kernels
- Autocasting
- Some GPU elastic functionality
- CPU optimizers (the kernels call CUDA stream synch)
- One-bit optimizers (they heavily rely on cupy)
- Coalesced collectives (pytorch only supports reduce_scatter for the NCCL backend)
It seems that there is still some issue for CPU backend, i try to use this branch to run the cifar example and meet the following issue:
deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json
[2022-09-07 03:26:52,142] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
File "/mnt/nfs-aicluster/liangan1/anaconda3/envs/neox/bin/deepspeed", line 7, in <module>
exec(compile(f.read(), __file__, 'exec'))
File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/bin/deepspeed", line 6, in <module>
main()
File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/deepspeed/launcher/runner.py", line 382, in main
raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available
It seems that there is still some issue for CPU backend, i try to use this branch to run the cifar example and meet the following issue:
deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json [2022-09-07 03:26:52,142] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Traceback (most recent call last): File "/mnt/nfs-aicluster/liangan1/anaconda3/envs/neox/bin/deepspeed", line 7, in <module> exec(compile(f.read(), __file__, 'exec')) File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/bin/deepspeed", line 6, in <module> main() File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/deepspeed/launcher/runner.py", line 382, in main raise RuntimeError("Unable to proceed, no GPU resources available") RuntimeError: Unable to proceed, no GPU resources available
This PR is still under development, but cifar was failing because I hadn't yet pushed CPU support for DeepSpeed's runner. It's been added now if you want to try (the cifar example only works if you use my changes from: https://github.com/Quentin-Anthony/DeepSpeedExamples/tree/cpu-support, and most other examples will probably fail unless similar edits are made)
Thanks for your quick reply. cifar deepspeed sample works with 2 proc after some little changes.
[1, 2000] loss: 1.692
[2022-09-08 14:25:57,361] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2022-09-08 14:25:57,362] [INFO] [timer.py:212:stop] 0/2000, RunningAvgSamplesPerSec=565.1165218730467, CurrSamplesPerSec=435.23204337477546, MemAllocated=0.0GB, MaxMemAllocated=0.0GB
[1, 2000] loss: 1.681
[2022-09-08 14:26:56,002] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2022-09-08 14:26:56,003] [INFO] [timer.py:212:stop] 0/4000, RunningAvgSamplesPerSec=555.6760088717376, CurrSamplesPerSec=624.6171258376769, MemAllocated=0.0GB, MaxMemAllocated=0.0GB
Closing this PR due to age and divergence from current develop branch.