DeepSpeed does deepspeed support pure bf16 training?

training 70B cost too large gpu memory when training 70+B model, the cost of gpu memory is too large, as mentioned in the deepspeed deepspeed-readthedocs-io-en-stable.pdf, the total memory was about 18xN, N is the number of params, if offload, the cpu memory may need 2T+. the main factor of this is the 32bit of (adam optmizer+gradient+copy of model param)

expect deepspeed can support pure bf16 training if using pure bf16, which means all param(model+gradient+optimizer) only using bf16, the cost of memory may down to 8xN, and in most cases, bf16 is enough, I have test this using llama-recipe(https://github.com/meta-llama/llama-recipes). the result of pure bf16 traing is very similar to deepspeed mixed precision, but llama-recipe using pure bf16 can train 70B in one node(8x80G A800) by freeze half layers

using mmap technology to offload optmizer states to disk is it possible to offload gradient and optimizer to disk, and then using mmap load it quickly?

Additional context Add any other context or screenshots about the feature request here.

Jul 21 '24 12:07 hjc3613

Hi, @hjc3613 , you can offload to nvme instead of cpu memory, please checkout out nvme offload.

Aug 09 '24 06:08 mklf

Hi, @hjc3613 , you can offload to nvme instead of cpu memory, please checkout out nvme offload.

thanks for your reply, I have test nvme offload, but failed, related issue:https://github.com/microsoft/DeepSpeed/issues/4888 I am confirm that my configs is correct, and nvme disk is installed normally

Aug 09 '24 06:08 hjc3613

the pure bf16 is better than nvme offload, beacuase if can store all params、gradients、adam optimizer states in gpu memory and can run faster than any other offload method, as it only consume 1/3 gpu memory of deepspeed mixed precision training strategy

Aug 09 '24 06:08 hjc3613

You can achieve that by setting fp32_optimizer_states=False in initialization of DeepSpeedCPUAdam, this param is added to deepspeed from version 0.14.3.

note: if you are using transformers trainer, it will create optimizer in its internal implemantation, you can't set optimizer param without hack to my knowledge.

Aug 09 '24 08:08 mklf

thanks a lot，I will have a try

Aug 09 '24 09:08 hjc3613

You can achieve that by setting fp32_optimizer_states=False in initialization of DeepSpeedCPUAdam, this param is added to deepspeed from version 0.14.3.

note: if you are using transformers trainer, it will create optimizer in its internal implemantation, you can't set optimizer param without hack to my knowledge.

hi，i test this method, but gives me error: fp32_optimizer_states extra fields not permitted. means that deepspeed do not support this config。my deepspeed version is 0.14.4

Aug 09 '24 14:08 hjc3613

make sure you are using DeepSpeedCPUAdam, you can find the signature here DeepSpeedCPUAdam

Aug 09 '24 23:08 mklf

thank you, I use it in a wrong place

Aug 09 '24 23:08 hjc3613

make sure you are using DeepSpeedCPUAdam, you can find the signature here DeepSpeedCPUAdam

i am running it success, but the only drawback is that this param only supported by cpuadam, which is already offload mode. but when offload, it actually doesn't need bf16, fp32 is more appropriate since cpu memory is sufficient. can none offload mode support pure bf16？i just want to save gpu memory with all parameters stay in gpu

Aug 11 '24 11:08 hjc3613