[BUG] Different outputs by original model and inference engine
Describe the bug
Different outputs by original model and inference engine

To Reproduce Code for reproduce:
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import deepspeed
test1 = "Hello world, I'm"
test2 = "Sampling test is very"
if __name__ == '__main__':
model = GPT2LMHeadModel.from_pretrained('gpt2')
or_model = model.to('cuda').to(torch.half)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
x = [tokenizer.encode(test1), tokenizer.encode(test1),
tokenizer.encode(test2), tokenizer.encode(test2)]
x = torch.as_tensor(x).to('cuda')
ds_engine = deepspeed.init_inference(GPT2LMHeadModel.from_pretrained('gpt2'),
mp_size=1,
dtype=torch.half,
checkpoint=None,
replace_method='auto',
replace_with_kernel_inject=True)
ds_model = ds_engine.module
or_output = or_model.generate(x, max_length=20)
ds_output = ds_model.generate(x, max_length=20)
or_samples = [tokenizer.decode(or_output[i]) for i in range(or_output.shape[0])]
ds_output = [tokenizer.decode(ds_output[i]) for i in range(ds_output.shape[0])]
assert or_samples == ds_output
ds_report output
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/user/.local/lib/python3.8/site-packages/torch']
torch version .................... 1.11.0+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.5
deepspeed install path ........... ['/home/user/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.7.0, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.11, cuda 11.3
Screenshots If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- Ubuntu 20.04.4
- 1 x A5000
- Python version== 3.8.10
- transformers==4.21.1
Launcher context
python main.py
@reymondzzzz I am able to reproduce these results. I'm not exactly sure why the outputs don't match. If I use the HuggingFace pipeline API, the problem goes away. Can you confirm this behavior on your system?
import torch
import deepspeed
from transformers import pipeline
test1 = "Hello world, I'm"
test2 = "Sampling test is very"
pipe = pipeline('text-generation', 'gpt2', device=0, framework='pt')
pipe.model = pipe.model.half()
or_output = pipe([test1, test2], max_length=20, do_sample=False)
pipe.model = deepspeed.init_inference(pipe.model,
mp_size=1,
dtype=torch.half,
checkpoint=None,
replace_method='auto',
replace_with_kernel_inject=True)
ds_output = pipe([test1, test2], max_length=20, do_sample=False)
print(or_output)
print(ds_output)
assert or_output == ds_output
@mrwyattii Thank you for your answer! This test is working, but I have custom model(gpt2) and I want to use DS for inference. I don't use hugging face in my project and pipeline too, but in my project I have different result between torch only and DS inference.
I think cuda kernels have some problem, because second string from ds_output looks like strange. Like mix test1 and test2
@reymondzzzz - can you please adddo_sample=False to your script?
Please modify the line where you call HF generate for both the original model and the DS model as follows.
ds_output = ds_model.generate(x, max_length=20, do_sample=False)
or_output = or_model.generate(x, max_length=20, do_sample=False)
@reymondzzzz - closing this issue for now. If the above method does not work, please reopen the issue.