DeepSpeed
DeepSpeed copied to clipboard
[BUG] Model running OOM after calling deepspeed.init_inference
Hi all,
I am trying to replicate the DeepSpeed Inference basic tutorial with the following script on a VM with 1x 40GB A100:
model = AutoModelForCausalLM.from_pretrained(
"EleutherAI/gpt-neox-20b",
torch_dtype=torch.float16,
#load_in_8bit=True,
device_map="auto",
)
ds_model = deepspeed.init_inference(
model=model,
mp_size=1,
dtype=torch.float16,
replace_method="auto",
replace_with_kernel_inject=True,
)
Model loads fine and consumes ~ 39GB out of ~ 41GB. But after calling deepspeed.init_inference I get a CUDA OOM error. Is there something I am missing? Does deepspeed need a bit overhead and thus, my VRAM is too low? How much additional VRAM do I need to be able to load the model for deepspeed inference?
Thanks a lot!