DeepSpeed Deepspeed Inference not working on llama when input has padding and using kernel injection

Describe the bug I am tryting to do batch inference, so the inputs needs padding. When using replace_with_kernel_inject=True, the engine output is incorrect. setting replace_with_kernel_inject=False produces correct output.

I also tried to disable the padding, it also works.

So it seems that kernel injection does not support well when input has padding.

To Reproduce Script:

import os

import torch
import torch.distributed as dist
import deepspeed
from transformers import AutoModelForCausalLM, GenerationConfig, AutoTokenizer


dist.init_process_group("nccl", world_size=int(os.environ["WORLD_SIZE"]))
# setup_distributed_slurm()

model_name="lmsys/vicuna-13b-v1.3" #'gpt2'   #
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)


texts = ["what is lemon?","tell me a story about moon", "what is kv cache in decoder-only transformoer models?", "explain e=mc^2",]

input_ids =[]
attention_masks=[]
for text in texts:
    model_inputs = tokenizer(text, max_length=96, padding="max_length",truncation=True,
                            return_tensors="pt")
    # if not padding, works fine
    # model_inputs = tokenizer(text,
    #                          return_tensors="pt")
    input_ids.append(model_inputs["input_ids"])
    attention_masks.append(model_inputs["attention_mask"])

input_ids = torch.cat(input_ids, dim=0).cuda()
attention_masks = torch.cat(attention_masks, dim=0).cuda()

generation_config = GenerationConfig(
    bos_token_id=1,
    do_sample=True,
    top_k=20,
    top_p=0.9,
    temperature=0.7,
    max_new_tokens=1024,
)

model = deepspeed.init_inference(model,
                        mp_size=1, dtype=torch.half,
                        checkpoint=None,
                        max_out_tokens = 1024,
                        replace_with_kernel_inject=True)    # replace_with_kernel_inject=False works fine

gen_out = model.generate(input_ids=input_ids,
                    attention_mask=attention_masks,
                    generation_config=generation_config,)
outputs = tokenizer.batch_decode(gen_out, skip_special_tokens=True)

for out in outputs:
    print(outputs)

Run this file with torch distributed.

Expected behavior Expect the output to be nomal text like：

['what is lemon?\n\nLemon is a fruit, which is commonly used as a flavoring agent in food and drinks. It is a small, round citrus fruit that is typically yellow when ripe, but can also be green when unripe. The lemon tree is a small evergreen tree that is native to Asia, but is now grown in many other regions of the world, including the United States, Europe, and Africa. Lemons are typically grown in warm, sunny climates, and they are often grown in orchards or groves. Lemons are often used in cooking and baking, ....

False output: (when kernel injection=True)

['what is lemon?\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n

ds_report output

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [YES] ...... [OKAY]
--------------------------------------------------
No CUDA runtime is found, using CUDA_HOME='/mnt/cuda-11.8'
DeepSpeed general environment info:
torch install path ............... ['/mnt/miniconda3/envs/pt20llm/lib/python3.9/site-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/mnt/miniconda3/envs/pt20llm/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.10.0+aef6c65, aef6c65, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8

System info (please complete the following information):

OS: [Centos 7]
GPU count and types [ 1 machines x1 A100s ]

Jul 14 '23 09:07 KimmiShi

Have you tried inferencing with llama 70B model too ?

Jul 27 '23 13:07 puneeshkhanna

Have you tried inferencing with llama 70B model too ?

I did not try 70B, it is too big. Is this related to model define/architecture?

Jul 28 '23 08:07 KimmiShi

I have the same problem on vicuna7b model

Nov 01 '23 08:11 yemin1996

Traceback (most recent call last):
  File "/mnt/cfs/eansonyan/llm_call/main_new.py", line 691, in <module>
    model_engine = deepspeed.init_inference(
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/__init__.py", line 346, in init_inference
    engine = InferenceEngine(model, config=ds_inference_config)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 158, in __init__
    self._apply_injection_policy(config)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/inference/engine.py", line 418, in _apply_injection_policy
    replace_transformer_layer(client_module, self.module, checkpoint, config, self.config)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 385, in replace_transformer_layer
    replaced_module = replace_module(model=model,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 634, in replace_module
    replaced_module, _ = _replace_module(model, policy, state_dict=sd)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 694, in _replace_module
    _, layer_id = _replace_module(child,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 694, in _replace_module
    _, layer_id = _replace_module(child,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 670, in _replace_module
    replaced_module = policies[child.__class__][0](child,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 315, in replace_fn
    new_module = replace_with_policy(child,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 254, in replace_with_policy
    _container.apply_tensor_parallelism(mp_replace)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/meta_tensor.py", line 36, in apply_tensor_parallelism
    super().apply_tensor_parallelism(mp_replace, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/hybrid_engine.py", line 89, in apply_tensor_parallelism
    self.attention_qkv_mp(mp_replace, reversed_dim=reversed_dim)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/features/split_qkv.py", line 49, in attention_qkv_mp
    super().attention_qkv_mp(mp_replace)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/containers/base.py", line 240, in attention_qkv_mp
    self.module.attention.attn_qkvw = mp_replace.strided_copy(self.module.attention.attn_qkvw,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/auto_tp.py", line 75, in strided_copy
    self.merge_assert(src_shape[outer_dim], dst_shape[self.out_dim])
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/module_inject/auto_tp.py", line 42, in merge_assert
    assert dim1 > dim2, \
AssertionError: Merging tensors is not allowed here! Please use deepspeed load_checkpoint            for merging your checkpoints before replacing the transformer layer with            inference-kernels

replace_with_kernel_inject=True cannot load llama3-8B-Instruct.

Aug 11 '24 03:08 eanson023

i have same problem with llama 3.2 3b

Mar 04 '25 09:03 PhamGiaMinh