[BUG] Inference fail with "mat1 and mat2 shapes cannot be multiplied" for Llama model.
Describe the bug
Inference fail with RuntimeErrorRuntimeError: : mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008) when trying to make Llama work on 2 gpus.
To Reproduce Steps to reproduce the behavior:
- This is a working exmaple to run inference with 1 gpu:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("models/llama-7b")
model = AutoModelForCausalLM.from_pretrained("models/llama-7b", torch_dtype=torch.half, device_map="auto")
model.cuda()
batch = tokenizer(
"The primary use of LLaMA is research on large language models, including",
return_tensors="pt",
add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))
And the results:
CUDA_VISIBLE_DEVICES=1 python ds_test.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:06<00:00, 4.80it/s]
The primary use of LLaMA is research on large language models, including the BERT model.
\subsection{Learning Language Models}
LLaMA is a tool for training large language models. It is designed to be used with the BERT model, but it can also be used with other large language models.
LLaMA is a tool for training large language models. It is designed to be used with the BERT model, but it can also
- Change the above script to use deepspeed inference with mp=2:
import torch
import deepspeed
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.models.llama.modeling_llama import LlamaDecoderLayer
tokenizer = AutoTokenizer.from_pretrained("models/llama-7b")
model = AutoModelForCausalLM.from_pretrained("models/llama-7b")
model = deepspeed.init_inference(
model,
mp_size=2,
dtype=torch.half,
injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.up_proj')}
)
batch = tokenizer(
"The primary use of LLaMA is research on large language models, including",
return_tensors="pt",
add_special_tokens=False
)
batch = {k: v.cuda() for k, v in batch.items()}
generated = model.generate(batch["input_ids"], max_length=100)
print(tokenizer.decode(generated[0]))
- Run the script
CUDA_VISIBLE_DEVICES=1,2 deepspeed ds_test2.py
[2023-03-26 10:21:41,179] [WARNING] [runner.py:186:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected CUDA_VISIBLE_DEVICES=1,2: setting --include=localhost:1,2
[2023-03-26 10:21:41,204] [INFO] [runner.py:550:main] cmd = /home/sgsdxzy/mambaforge/envs/textgen/bin/python3.10 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMSwgMl19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None ds_test2.py
[2023-03-26 10:21:42,533] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [1, 2]}
[2023-03-26 10:21:42,533] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-03-26 10:21:42,534] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-03-26 10:21:42,534] [INFO] [launch.py:162:main] dist_world_size=2
[2023-03-26 10:21:42,534] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=1,2
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:08<00:00, 3.88it/s]
[2023-03-26 10:22:30,419] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-03-26 10:22:30,420] [WARNING] [config_utils.py:75:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-26 10:22:30,420] [INFO] [logging.py:93:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 33/33 [00:08<00:00, 3.85it/s]
[2023-03-26 10:22:30,531] [INFO] [logging.py:93:log_dist] [Rank -1] DeepSpeed info: version=0.8.3, git-hash=unknown, git-branch=unknown
[2023-03-26 10:22:30,532] [WARNING] [config_utils.py:75:_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-03-26 10:22:30,533] [INFO] [logging.py:93:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
[2023-03-26 10:22:33,059] [INFO] [comm.py:652:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Traceback (most recent call last):
Traceback (most recent call last):
File "/home/sgsdxzy/Programs/text-generation-webui/ds_test2.py", line 21, in <module>
File "/home/sgsdxzy/Programs/text-generation-webui/ds_test2.py", line 21, in <module>
generated = model.generate(batch["input_ids"], max_length=100)generated = model.generate(batch["input_ids"], max_length=100)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 588, in _generate
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 588, in _generate
return self.module.generate(*inputs, **kwargs)return self.module.generate(*inputs, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)return func(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1416, in generate
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1416, in generate
return self.greedy_search(return self.greedy_search(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2211, in greedy_search
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2211, in greedy_search
outputs = self(outputs = self(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
return forward_call(*args, **kwargs) File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
outputs = self.model(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
outputs = self.model(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
layer_outputs = decoder_layer(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
layer_outputs = decoder_layer(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
hidden_states = self.mlp(hidden_states)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
hidden_states = self.mlp(hidden_states)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 151, in forward
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 151, in forward
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/layers.py", line 20, in forward
return forward_call(*args, **kwargs)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/module_inject/layers.py", line 20, in forward
output = torch.matmul(input, self.weight.transpose(-1, -2))output = torch.matmul(input, self.weight.transpose(-1, -2))
RuntimeErrorRuntimeError: : mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)mat1 and mat2 shapes cannot be multiplied (15x4096 and 2048x11008)
[2023-03-26 10:22:39,595] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3269
[2023-03-26 10:22:39,613] [INFO] [launch.py:318:sigkill_handler] Killing subprocess 3270
[2023-03-26 10:22:39,613] [ERROR] [launch.py:324:sigkill_handler] ['/home/sgsdxzy/mambaforge/envs/textgen/bin/python3.10', '-u', 'ds_test2.py', '--local_rank=1'] exits with return code = 1
Expected behavior
- The inference should work with mp=2, and give similar results to with 1 gpu.
- I have an extra question: How should I load the model directly into gpus, preferably directly in 8-bits? The above scirpt loads into to system RAM first for each process, which loads the weights twice. It is inefficient and does not work on low RAM systems.
from_pretrainedsupportsdevice_mapto load directly into corresponding gpu andload_in_8bitto load directly in 8-bits, but how should I supply that for deepspeed inference to work?
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio package with pacman
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0
deepspeed install path ........... ['/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.8.3, unknown, unknown
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
System info (please complete the following information):
- OS: Archlinux
- 2080Ti 22G x4 on single node
- Using git+https://github.com/huggingface/transformers
- Python version: 3.10.9
Additional context The LLaMA model has a structure like this
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 5120, padding_idx=0)
(layers): ModuleList(
(0-39): 40 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=5120, out_features=5120, bias=False)
(k_proj): Linear(in_features=5120, out_features=5120, bias=False)
(v_proj): Linear(in_features=5120, out_features=5120, bias=False)
(o_proj): Linear(in_features=5120, out_features=5120, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=5120, out_features=13824, bias=False)
(down_proj): Linear(in_features=13824, out_features=5120, bias=False)
(up_proj): Linear(in_features=5120, out_features=13824, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=5120, out_features=32000, bias=False)
)
So is injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.up_proj')} is correct?
@sgsdxzy Can you share me models/llama-7b? or any similar model from HF so that I can reproduce.
@satpalsr I think you can use https://huggingface.co/decapoda-research/llama-7b-hf
If you are using latest transformers, you may need to Change LLaMATokenizer to LlamaTokenizer in tokenizer_config.json, and LLaMAForCausalLM to LlamaForCausalLM in config.json.
@sgsdxzy Looks like it should be 'mlp.down_proj' instead of 'mlp.up_proj' injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')}
@satpalsr Thanks, TP inference works now!
So my next question: is it possible to split-load this checkpoint using the meta device? I find if I use AutoModelForCausalLM.from_pretrained to load the model, the model is replicated on each gpu, not sharded, so each gpu must have enough VRAM to hold the entire model.
I tried something like
config = AutoConfig.from_pretrained(model_name)
with deepspeed.OnDevice(dtype=torch.half, device="meta"):
model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.half)
model = model.eval()
model = deepspeed.init_inference(
model,
mp_size=world_size,
dtype=torch.half,
checkpoint="checkpoints.json",
injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')}
)
and the contents of checkpoints.json:
{"type": "ds_model",
"checkpoints": ["path_to_pytorch_model-00000-of-00033.bin",
.......,
"path_to_pytorch_model-00033-of-00033.bin"],
"version": 1.0}
However it fails with
Traceback (most recent call last):
File "/home/sgsdxzy/Programs/text-generation-webui/ds_test.py", line 20, in <module>
model = deepspeed.init_inference(
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 96, in __init__
self._load_checkpoint(config.checkpoint)
File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _load_checkpoint
load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size,
AttributeError: 'dict' object has no attribute 'load'
@satpalsr Thanks, TP inference works now! So my next question: is it possible to split-load this checkpoint using the meta device? I find if I use
AutoModelForCausalLM.from_pretrainedto load the model, the model is replicated on each gpu, not sharded, so each gpu must have enough VRAM to hold the entire model. I tried something likeconfig = AutoConfig.from_pretrained(model_name) with deepspeed.OnDevice(dtype=torch.half, device="meta"): model = AutoModelForCausalLM.from_config(config, torch_dtype=torch.half) model = model.eval() model = deepspeed.init_inference( model, mp_size=world_size, dtype=torch.half, checkpoint="checkpoints.json", injection_policy={LlamaDecoderLayer: ('self_attn.o_proj', 'mlp.down_proj')} )and the contents of checkpoints.json:
{"type": "ds_model", "checkpoints": ["path_to_pytorch_model-00000-of-00033.bin", ......., "path_to_pytorch_model-00033-of-00033.bin"], "version": 1.0}However it fails with
Traceback (most recent call last): File "/home/sgsdxzy/Programs/text-generation-webui/ds_test.py", line 20, in <module> model = deepspeed.init_inference( File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/__init__.py", line 311, in init_inference engine = InferenceEngine(model, config=ds_inference_config) File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 96, in __init__ self._load_checkpoint(config.checkpoint) File "/home/sgsdxzy/mambaforge/envs/textgen/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 426, in _load_checkpoint load_path, checkpoint, quantize_config = sd_loader.load(self._config.tensor_parallel.tp_size, AttributeError: 'dict' object has no attribute 'load'
I also met this problem, did you solve it?
same problem.
same :(
Just checking back if any of you solved it. Meta tensors seem not being materialised properly in deepspeed here.
same problem, how to split-load the checkpoint using the meta device when using deepspped.init_inference?
Same!