Deepspeed Inference not working on llama when input has padding and using kernel injection
Describe the bug
I am tryting to do batch inference, so the inputs needs padding. When using replace_with_kernel_inject=True, the engine output is incorrect. setting replace_with_kernel_inject=False produces correct output.
I also tried to disable the padding, it also works.
So it seems that kernel injection does not support well when input has padding.
To Reproduce Script:
import os
import torch
import torch.distributed as dist
import deepspeed
from transformers import AutoModelForCausalLM, GenerationConfig, AutoTokenizer
dist.init_process_group("nccl", world_size=int(os.environ["WORLD_SIZE"]))
# setup_distributed_slurm()
model_name="lmsys/vicuna-13b-v1.3" #'gpt2' #
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
texts = ["what is lemon?","tell me a story about moon", "what is kv cache in decoder-only transformoer models?", "explain e=mc^2",]
input_ids =[]
attention_masks=[]
for text in texts:
model_inputs = tokenizer(text, max_length=96, padding="max_length",truncation=True,
return_tensors="pt")
# if not padding, works fine
# model_inputs = tokenizer(text,
# return_tensors="pt")
input_ids.append(model_inputs["input_ids"])
attention_masks.append(model_inputs["attention_mask"])
input_ids = torch.cat(input_ids, dim=0).cuda()
attention_masks = torch.cat(attention_masks, dim=0).cuda()
generation_config = GenerationConfig(
bos_token_id=1,
do_sample=True,
top_k=20,
top_p=0.9,
temperature=0.7,
max_new_tokens=1024,
)
model = deepspeed.init_inference(model,
mp_size=1, dtype=torch.half,
checkpoint=None,
max_out_tokens = 1024,
replace_with_kernel_inject=True) # replace_with_kernel_inject=False works fine
gen_out = model.generate(input_ids=input_ids,
attention_mask=attention_masks,
generation_config=generation_config,)
outputs = tokenizer.batch_decode(gen_out, skip_special_tokens=True)
for out in outputs:
print(outputs)
Run this file with torch distributed.
Expected behavior Expect the output to be nomal text like:
['what is lemon?\n\nLemon is a fruit, which is commonly used as a flavoring agent in food and drinks. It is a small, round citrus fruit that is typically yellow when ripe, but can also be green when unripe. The lemon tree is a small evergreen tree that is native to Asia, but is now grown in many other regions of the world, including the United States, Europe, and Africa. Lemons are typically grown in warm, sunny climates, and they are often grown in orchards or groves. Lemons are often used in cooking and baking, ....
False output: (when kernel injection=True)
['what is lemon?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n
ds_report output
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/mnt/cuda-11.8'
DeepSpeed general environment info:
torch install path ............... ['/mnt/miniconda3/envs/pt20llm/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/mnt/miniconda3/envs/pt20llm/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.10.0+aef6c65, aef6c65, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
System info (please complete the following information):
- OS: [Centos 7]
- GPU count and types [ 1 machines x1 A100s ]
Have you tried inferencing with llama 70B model too ?
Have you tried inferencing with llama 70B model too ?
I did not try 70B, it is too big. Is this related to model define/architecture?
I have the same problem on vicuna7b model
Traceback (most recent call last):
File "/mnt/cfs/eansonyan/llm_call/main_new.py", line 691, in <module>
model_engine = deepspeed.init_inference(
File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 346, in init_inference
engine = InferenceEngine(model, config=ds_inference_config)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 158, in __init__
self._apply_injection_policy(config)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 418, in _apply_injection_policy
replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 385, in replace_transformer_layer
replaced_module = replace_module(model=model,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 634, in replace_module
replaced_module, _ = _replace_module(model, policy, state_dict=sd)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 694, in _replace_module
_, layer_id = _replace_module(child,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 694, in _replace_module
_, layer_id = _replace_module(child,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 670, in _replace_module
replaced_module = policies[child.__class__][0](child,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 315, in replace_fn
new_module = replace_with_policy(child,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 254, in replace_with_policy
_container.apply_tensor_parallelism(mp_replace)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/meta_tensor.py", line 36, in apply_tensor_parallelism
super().apply_tensor_parallelism(mp_replace, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/hybrid_engine.py", line 89, in apply_tensor_parallelism
self.attention_qkv_mp(mp_replace, reversed_dim=reversed_dim)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/split_qkv.py", line 49, in attention_qkv_mp
super().attention_qkv_mp(mp_replace)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/base.py", line 240, in attention_qkv_mp
self.module.attention.attn_qkvw = mp_replace.strided_copy(self.module.attention.attn_qkvw,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/auto_tp.py", line 75, in strided_copy
self.merge_assert(src_shape[outer_dim], dst_shape[self.out_dim])
File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/auto_tp.py", line 42, in merge_assert
assert dim1 > dim2, \
AssertionError: Merging tensors is not allowed here! Please use deepspeed load_checkpoint for merging your checkpoints before replacing the transformer layer with inference-kernels
replace_with_kernel_inject=True cannot load llama3-8B-Instruct.
i have same problem with llama 3.2 3b