[BUG] DeepSpeed loads the whole codegen model into GPU
I am trying 4x sharing for "Salesforce/codegen-16B-mono". 4 A10 chips (24 GiB each). torch type is torch.half.
My math told me (please double check): If the sharding is correct, then the model can be loaded into each GPU chip. But got a GPU OOM.
I found this and tried the injection policy in that, it worked (model is loadable) but hit a reshape error later. Seems the deepspeed needs special config for codegen model to make it loadable and runnable.
Any suggestions? cc @RezaYazdaniAminabadi
Hi @xiejw, codegen is not supported currently because it has a fused qkv and you're right that we need a special case for it.
Thanks @molly-smith . Do you have any suggestions how to make it work first? And then make it fast? I think I am ok for non-fused qkv given 16B is quite large, so GPU can be busy for a while (I could be wrong though)
That will be very appreciated.
Hi @xiejw,
Can you try this PR using the kernels and as well as mp>1 and see if it works for you? Thanks, Reza
Hi @RezaYazdaniAminabadi
I tried to patch your code manually but it is very confusing how to test your code
deepspeed 0.8.1 has new changes which throws errors like
assert AutoTP.supported(model), "Automatic policy not supported for model. Please provide policy."
w/w.o. your changes (I manually patched). I passed --use_kernel, same error.
deepspeed 0.8.0 is the original version I used. the folder structure is different, for example, no this folder deepspeed/module_inject/containers
How can I test your PR? Thanks
Hi @xiejw,
Thanks for trying this out. let me try it on my side again and see if I can repro the same issue. Thanks, Reza
@xiejw, are you trying this the same way I described in the PR?
@xiejw, can you please try this again, passing --replace_method 'auto' when running with inference-test.py?