DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

does deepspeed support pure bf16 training?

Open hjc3613 opened this issue 1 year ago • 9 comments

training 70B cost too large gpu memory when training 70+B model, the cost of gpu memory is too large, as mentioned in the deepspeed deepspeed-readthedocs-io-en-stable.pdf, the total memory was about 18xN, N is the number of params, if offload, the cpu memory may need 2T+. the main factor of this is the 32bit of (adam optmizer+gradient+copy of model param)

expect deepspeed can support pure bf16 training if using pure bf16, which means all param(model+gradient+optimizer) only using bf16, the cost of memory may down to 8xN, and in most cases, bf16 is enough, I have test this using llama-recipe(https://github.com/meta-llama/llama-recipes). the result of pure bf16 traing is very similar to deepspeed mixed precision, but llama-recipe using pure bf16 can train 70B in one node(8x80G A800) by freeze half layers

using mmap technology to offload optmizer states to disk is it possible to offload gradient and optimizer to disk, and then using mmap load it quickly?

Additional context Add any other context or screenshots about the feature request here.

hjc3613 avatar Jul 21 '24 12:07 hjc3613

Hi, @hjc3613 , you can offload to nvme instead of cpu memory, please checkout out nvme offload.

mklf avatar Aug 09 '24 06:08 mklf

Hi, @hjc3613 , you can offload to nvme instead of cpu memory, please checkout out nvme offload.

thanks for your reply, I have test nvme offload, but failed, related issue:https://github.com/microsoft/DeepSpeed/issues/4888 I am confirm that my configs is correct, and nvme disk is installed normally

hjc3613 avatar Aug 09 '24 06:08 hjc3613

the pure bf16 is better than nvme offload, beacuase if can store all params、gradients、adam optimizer states in gpu memory and can run faster than any other offload method, as it only consume 1/3 gpu memory of deepspeed mixed precision training strategy

hjc3613 avatar Aug 09 '24 06:08 hjc3613

You can achieve that by setting fp32_optimizer_states=False in initialization of DeepSpeedCPUAdam, this param is added to deepspeed from version 0.14.3.

note: if you are using transformers trainer, it will create optimizer in its internal implemantation, you can't set optimizer param without hack to my knowledge.

mklf avatar Aug 09 '24 08:08 mklf

thanks a lot,I will have a try

hjc3613 avatar Aug 09 '24 09:08 hjc3613

You can achieve that by setting fp32_optimizer_states=False in initialization of DeepSpeedCPUAdam, this param is added to deepspeed from version 0.14.3.

note: if you are using transformers trainer, it will create optimizer in its internal implemantation, you can't set optimizer param without hack to my knowledge.

hi,i test this method, but gives me error: fp32_optimizer_states extra fields not permitted. means that deepspeed do not support this config。my deepspeed version is 0.14.4

hjc3613 avatar Aug 09 '24 14:08 hjc3613

make sure you are using DeepSpeedCPUAdam, you can find the signature here DeepSpeedCPUAdam

mklf avatar Aug 09 '24 23:08 mklf

thank you, I use it in a wrong place

hjc3613 avatar Aug 09 '24 23:08 hjc3613

make sure you are using DeepSpeedCPUAdam, you can find the signature here DeepSpeedCPUAdam

i am running it success, but the only drawback is that this param only supported by cpuadam, which is already offload mode. but when offload, it actually doesn't need bf16, fp32 is more appropriate since cpu memory is sufficient. can none offload mode support pure bf16?i just want to save gpu memory with all parameters stay in gpu

hjc3613 avatar Aug 11 '24 11:08 hjc3613