DeepSpeed DeepSpeed CPU Support

@ShadenSmith

Enables:

Training/inference for most workloads on CPU systems
DeepSpeed development on systems without GPUs (including personal machines) to save compute resources

Most code changes boil down to guards within torch.cuda.is_available(), especially within the tests.

The following features don't work on CPU currently (But should. Working on it):

[ ] - Zero stage 3
[ ] - aio

The following features don't work on CPU (and shouldn't):

Fused kernels
Autocasting
Some GPU elastic functionality
CPU optimizers (the kernels call CUDA stream synch)
One-bit optimizers (they heavily rely on cupy)
Coalesced collectives (pytorch only supports reduce_scatter for the NCCL backend)

Aug 20 '22 20:08 Quentin-Anthony

All CLA requirements met.

Aug 20 '22 20:08 ghost

It seems that there is still some issue for CPU backend, i try to use this branch to run the cifar example and meet the following issue:

deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json
[2022-09-07 03:26:52,142] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
  File "/mnt/nfs-aicluster/liangan1/anaconda3/envs/neox/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/deepspeed/launcher/runner.py", line 382, in main
    raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available

Sep 07 '22 07:09 liangan1

It seems that there is still some issue for CPU backend, i try to use this branch to run the cifar example and meet the following issue:

deepspeed cifar10_deepspeed.py --deepspeed_config ds_config.json
[2022-09-07 03:26:52,142] [WARNING] [runner.py:178:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Traceback (most recent call last):
  File "/mnt/nfs-aicluster/liangan1/anaconda3/envs/neox/bin/deepspeed", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/bin/deepspeed", line 6, in <module>
    main()
  File "/mnt/nfs-aicluster/liangan1/gpt-neox/DeepSpeed/deepspeed/launcher/runner.py", line 382, in main
    raise RuntimeError("Unable to proceed, no GPU resources available")
RuntimeError: Unable to proceed, no GPU resources available

This PR is still under development, but cifar was failing because I hadn't yet pushed CPU support for DeepSpeed's runner. It's been added now if you want to try (the cifar example only works if you use my changes from: https://github.com/Quentin-Anthony/DeepSpeedExamples/tree/cpu-support, and most other examples will probably fail unless similar edits are made)

Sep 07 '22 21:09 Quentin-Anthony

Thanks for your quick reply. cifar deepspeed sample works with 2 proc after some little changes.

[1,  2000] loss: 1.692
[2022-09-08 14:25:57,361] [INFO] [logging.py:68:log_dist] [Rank 0] step=2000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2022-09-08 14:25:57,362] [INFO] [timer.py:212:stop] 0/2000, RunningAvgSamplesPerSec=565.1165218730467, CurrSamplesPerSec=435.23204337477546, MemAllocated=0.0GB, MaxMemAllocated=0.0GB
[1,  2000] loss: 1.681
[2022-09-08 14:26:56,002] [INFO] [logging.py:68:log_dist] [Rank 0] step=4000, skipped=0, lr=[0.001], mom=[[0.8, 0.999]]
[2022-09-08 14:26:56,003] [INFO] [timer.py:212:stop] 0/4000, RunningAvgSamplesPerSec=555.6760088717376, CurrSamplesPerSec=624.6171258376769, MemAllocated=0.0GB, MaxMemAllocated=0.0GB

Sep 08 '22 05:09 liangan1

Closing this PR due to age and divergence from current develop branch.

Aug 25 '23 23:08 jomayeri