[BUG] Model running OOM after calling deepspeed.init_inference

Open agademic opened this issue 2 years ago • 0 comments

Hi all,

I am trying to replicate the DeepSpeed Inference basic tutorial with the following script on a VM with 1x 40GB A100:

model = AutoModelForCausalLM.from_pretrained(
    "EleutherAI/gpt-neox-20b",
    torch_dtype=torch.float16,
    #load_in_8bit=True,
    device_map="auto",
)

ds_model = deepspeed.init_inference(
    model=model,  
    mp_size=1, 
    dtype=torch.float16, 
    replace_method="auto", 
    replace_with_kernel_inject=True,
)

Model loads fine and consumes ~ 39GB out of ~ 41GB. But after calling deepspeed.init_inference I get a CUDA OOM error. Is there something I am missing? Does deepspeed need a bit overhead and thus, my VRAM is too low? How much additional VRAM do I need to be able to load the model for deepspeed inference?

Thanks a lot!

Mar 15 '23 15:03 agademic