DeepSpeed [BUG] Different outputs by original model and inference engine

Describe the bug Different outputs by original model and inference engine

To Reproduce Code for reproduce:

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

import deepspeed

test1 = "Hello world, I'm"
test2 = "Sampling test is very"

if __name__ == '__main__':
    model = GPT2LMHeadModel.from_pretrained('gpt2')
    or_model = model.to('cuda').to(torch.half)
    tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

    x = [tokenizer.encode(test1), tokenizer.encode(test1),
         tokenizer.encode(test2), tokenizer.encode(test2)]
    x = torch.as_tensor(x).to('cuda')

    ds_engine = deepspeed.init_inference(GPT2LMHeadModel.from_pretrained('gpt2'),
                                         mp_size=1,
                                         dtype=torch.half,
                                         checkpoint=None,
                                         replace_method='auto',
                                         replace_with_kernel_inject=True)
    ds_model = ds_engine.module


    or_output = or_model.generate(x, max_length=20)
    ds_output = ds_model.generate(x, max_length=20)
    or_samples = [tokenizer.decode(or_output[i]) for i in range(or_output.shape[0])]
    ds_output = [tokenizer.decode(ds_output[i]) for i in range(ds_output.shape[0])]
    assert or_samples == ds_output

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/user/.local/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/home/user/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

Ubuntu 20.04.4
1 x A5000
Python version== 3.8.10
transformers==4.21.1

Launcher context

python main.py

Aug 17 '22 12:08 reymondzzzz

@reymondzzzz I am able to reproduce these results. I'm not exactly sure why the outputs don't match. If I use the HuggingFace pipeline API, the problem goes away. Can you confirm this behavior on your system?

import torch
import deepspeed
from transformers import pipeline

test1 = "Hello world, I'm"
test2 = "Sampling test is very"

pipe = pipeline('text-generation', 'gpt2', device=0, framework='pt')
pipe.model = pipe.model.half()
or_output = pipe([test1, test2], max_length=20, do_sample=False)

pipe.model = deepspeed.init_inference(pipe.model,
                                     mp_size=1,
                                     dtype=torch.half,
                                     checkpoint=None,
                                     replace_method='auto',
                                     replace_with_kernel_inject=True)
ds_output = pipe([test1, test2], max_length=20, do_sample=False)

print(or_output)
print(ds_output)
assert or_output == ds_output

Aug 17 '22 16:08 mrwyattii

@mrwyattii Thank you for your answer! This test is working, but I have custom model(gpt2) and I want to use DS for inference. I don't use hugging face in my project and pipeline too, but in my project I have different result between torch only and DS inference.

Aug 17 '22 20:08 reymondzzzz

I think cuda kernels have some problem, because second string from ds_output looks like strange. Like mix test1 and test2

Aug 17 '22 20:08 reymondzzzz

@reymondzzzz - can you please adddo_sample=False to your script?

Please modify the line where you call HF generate for both the original model and the DS model as follows.

ds_output = ds_model.generate(x, max_length=20, do_sample=False) or_output = or_model.generate(x, max_length=20, do_sample=False)

Oct 28 '22 18:10 awan-10

@reymondzzzz - closing this issue for now. If the above method does not work, please reopen the issue.

Nov 11 '22 19:11 awan-10