DeepSpeed [BUG] Inference fail with "mat1 and mat2 shapes cannot be multiplied" for Llama model.

Describe the bug Inference fail with RuntimeErrorRuntimeError: : mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008) when trying to make Llama work on 2 gpus.

To Reproduce Steps to reproduce the behavior:

This is a working exmaple to run inference with 1 gpu:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("models/llama-7b")
model = AutoModelForCausalLM.from_pretrained("models/llama-7b", torch_dtype=torch.half, device_map="auto")
model.cuda()

batch = tokenizer(
    "The primary use of LLaMA is research on large language models, including",
    return_tensors="pt", 
    add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))

And the results:

CUDA_VISIBLE_DEVICES=1 python ds_test.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:06<00:00,  4.80it/s]
 The primary use of LLaMA is research on large language models, including the BERT model.

\subsection{Learning Language Models}

LLaMA is a tool for training large language models. It is designed to be used with the BERT model, but it can also be used with other large language models.

LLaMA is a tool for training large language models. It is designed to be used with the BERT model, but it can also

Change the above script to use deepspeed inference with mp=2:

import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.llama.modeling_llama import LlamaDecoderLayer

tokenizer = AutoTokenizer.from_pretrained("models/llama-7b")
model = AutoModelForCausalLM.from_pretrained("models/llama-7b")
model = deepspeed.init_inference(
    model,
    mp_size=2,
    dtype=torch.half,
    injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.up_proj')}
)

batch = tokenizer(
    "The primary use of LLaMA is research on large language models, including",
    return_tensors="pt", 
    add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))

Run the script

CUDA_VISIBLE_DEVICES=1,2 deepspeed ds_test2.py 
[2023-03-26 10:21:41,179] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=1,2: setting --include=localhost:1,2
[2023-03-26 10:21:41,204] [INFO] [runner.py:550:main] cmd = /home/sgsdxzy/mambaforge/envs/textgen/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ds_test2.py
[2023-03-26 10:21:42,533] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [1, 2]}
[2023-03-26 10:21:42,533] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-03-26 10:21:42,534] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-03-26 10:21:42,534] [INFO] [launch.py:162:main] dist_world_size=2
[2023-03-26 10:21:42,534] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=1,2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:08<00:00,  3.88it/s]
[2023-03-26 10:22:30,419] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-03-26 10:22:30,420] [WARNING] [config_utils.py:75:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-26 10:22:30,420] [INFO] [logging.py:93:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:08<00:00,  3.85it/s]
[2023-03-26 10:22:30,531] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-03-26 10:22:30,532] [WARNING] [config_utils.py:75:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-26 10:22:30,533] [INFO] [logging.py:93:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-03-26 10:22:33,059] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/sgsdxzy/Programs/text-generation-webui/ds_test2.py", line 21, in <module>
  File "/home/sgsdxzy/Programs/text-generation-webui/ds_test2.py", line 21, in <module>
        generated = model.generate(batch["input_ids"], max_length=100)generated = model.generate(batch["input_ids"], max_length=100)

  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 588, in _generate
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 588, in _generate
        return self.module.generate(*inputs, **kwargs)return self.module.generate(*inputs, **kwargs)

  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
        return func(*args, **kwargs)return func(*args, **kwargs)

  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1416, in generate
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1416, in generate
        return self.greedy_search(return self.greedy_search(

  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2211, in greedy_search
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2211, in greedy_search
        outputs = self(outputs = self(

  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)    
return forward_call(*args, **kwargs)  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward

  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    outputs = self.model(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    layer_outputs = decoder_layer(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    hidden_states = self.mlp(hidden_states)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    hidden_states = self.mlp(hidden_states)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 151, in forward
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 151, in forward
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/layers.py", line 20, in forward
    return forward_call(*args, **kwargs)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/layers.py", line 20, in forward
        output = torch.matmul(input, self.weight.transpose(-1, -2))output = torch.matmul(input, self.weight.transpose(-1, -2))

RuntimeErrorRuntimeError: : mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)

[2023-03-26 10:22:39,595] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3269
[2023-03-26 10:22:39,613] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3270
[2023-03-26 10:22:39,613] [ERROR] [launch.py:324:sigkill_handler] ['/home/sgsdxzy/mambaforge/envs/textgen/bin/python3.10', '-u', 'ds_test2.py', '--local_rank=1'] exits with return code = 1

Expected behavior

The inference should work with mp=2, and give similar results to with 1 gpu.
I have an extra question: How should I load the model directly into gpus, preferably directly in 8-bits? The above scirpt loads into to system RAM first for each process, which loads the weights twice. It is inefficient and does not work on low RAM systems. from_pretrained supports device_map to load directly into corresponding gpu and load_in_8bit to load directly in 8-bits, but how should I supply that for deepspeed inference to work?

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio package with pacman
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

System info (please complete the following information):

OS: Archlinux
2080Ti 22G x4 on single node
Using git+https://github.com/huggingface/transformers
Python version: 3.10.9

Additional context The LLaMA model has a structure like this

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 5120, padding_idx=0)
    (layers): ModuleList(
      (0-39): 40 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (k_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (v_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (o_proj): Linear(in_features=5120, out_features=5120, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (down_proj): Linear(in_features=13824, out_features=5120, bias=False)
          (up_proj): Linear(in_features=5120, out_features=13824, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)

So is injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.up_proj')} is correct?

Mar 26 '23 06:03 sgsdxzy

@sgsdxzy Can you share me models/llama-7b? or any similar model from HF so that I can reproduce.

Mar 30 '23 10:03 satpalsr

@satpalsr I think you can use https://huggingface.co/decapoda-research/llama-7b-hf If you are using latest transformers, you may need to Change LLaMATokenizer to LlamaTokenizer in tokenizer_config.json, and LLaMAForCausalLM to LlamaForCausalLM in config.json.

Mar 30 '23 12:03 sgsdxzy

@sgsdxzy Looks like it should be 'mlp.down_proj' instead of 'mlp.up_proj' injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')}

Mar 30 '23 14:03 satpalsr

@satpalsr Thanks, TP inference works now! So my next question: is it possible to split-load this checkpoint using the meta device? I find if I use AutoModelForCausalLM.from_pretrained to load the model, the model is replicated on each gpu, not sharded, so each gpu must have enough VRAM to hold the entire model. I tried something like

config = AutoConfig.from_pretrained(model_name)
with deepspeed.OnDevice(dtype=torch.half, device="meta"):
    model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.half)
model = model.eval()

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.half,
    checkpoint="checkpoints.json",
    injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')}
)

and the contents of checkpoints.json:

{"type": "ds_model", 
"checkpoints": ["path_to_pytorch_model-00000-of-00033.bin", 
.......,
"path_to_pytorch_model-00033-of-00033.bin"], 
"version": 1.0}

However it fails with

Traceback (most recent call last):
  File "/home/sgsdxzy/Programs/text-generation-webui/ds_test.py", line 20, in <module>
    model = deepspeed.init_inference(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 96, in __init__
    self._load_checkpoint(config.checkpoint)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _load_checkpoint
    load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size,
AttributeError: 'dict' object has no attribute 'load'

Mar 30 '23 15:03 sgsdxzy

@satpalsr Thanks, TP inference works now! So my next question: is it possible to split-load this checkpoint using the meta device? I find if I use AutoModelForCausalLM.from_pretrained to load the model, the model is replicated on each gpu, not sharded, so each gpu must have enough VRAM to hold the entire model. I tried something like

config = AutoConfig.from_pretrained(model_name)
with deepspeed.OnDevice(dtype=torch.half, device="meta"):
    model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.half)
model = model.eval()

model = deepspeed.init_inference(
    model,
    mp_size=world_size,
    dtype=torch.half,
    checkpoint="checkpoints.json",
    injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')}
)

and the contents of checkpoints.json:

{"type": "ds_model", 
"checkpoints": ["path_to_pytorch_model-00000-of-00033.bin", 
.......,
"path_to_pytorch_model-00033-of-00033.bin"], 
"version": 1.0}

However it fails with

Traceback (most recent call last):
  File "/home/sgsdxzy/Programs/text-generation-webui/ds_test.py", line 20, in <module>
    model = deepspeed.init_inference(
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 96, in __init__
    self._load_checkpoint(config.checkpoint)
  File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _load_checkpoint
    load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size,
AttributeError: 'dict' object has no attribute 'load'

I also met this problem, did you solve it?