DeepSpeed [BUG] Wrong logits/outputs when using HFOPTLayerPolicy on OPT model

Describe the bug DeepSpeed optimized OPT inference produces garbage if using HFOPTLayerPolicy for kernel injection. Below are the results of l2 and l1 difference when HFOPTLayerPolicy is used vs. when it is not used on OPT-350m. The bug affects other OPT versions (tested on 13B) and other deepspeed versions (tested deepspeed v0.75 and the main branch from github). The bug affects downstream accuracy of the model as well (i.e. on LAMBADA it gets 0% accuracy with HFOPTLayerPolicy)

Numerical differences between deepspeed enabled version and pure HF version on the "This is test" phrase is below

	avg L2 diff (FP16)	rel L2 diff (FP16)	avg L2 diff (FP32)	rel L2 diff (FP32)
HFOPTLayerPolicy is enabled	196.887	3.929	196.811	3.925
HFOPTLayerPolicy is disabled	0.0	0.0	0.0	0.0

Here avg l2 diff is computed as $\sum_i \frac{1}{N}(x_i-y_i)^2$ and rel l2 diff as $\frac{\sum_i (x_i-y_i)^2}{\sum_i y_i^2}$ where $x$ is output of DP model and $y$ is output of HF model (ground truth).

To Reproduce To reproduce the behavior run the following script with different options. The script simply runs deepspeed version and compares the output against the pure huggingface version.

Script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed 
import torch

from deepspeed.module_inject.replace_policy import HFOPTLayerPolicy
import transformers

def simple_output_comparision(model_id, dtype1, dtype2, use_dp=False, use_policy=True):
    device0 = "cuda:0"
    device1 = "cuda:1"
    model1 = AutoModelForCausalLM.from_pretrained(model_id).to(device0, dtype=dtype1)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    if use_dp:
        model1 = deepspeed.init_inference(model1, 
            mp_size=1, 
            dtype=dtype1, 
            replace_method='auto', 
            injection_policy= {transformers.models.opt.modeling_opt.OPTDecoderLayer: HFOPTLayerPolicy} if use_policy else {},
            replace_with_kernel_inject=True)


    model2 = AutoModelForCausalLM.from_pretrained(model_id).to(device1, dtype=dtype2)
    test_input = """This is test """
    encodings = tokenizer(test_input, return_tensors="pt")

    input_ids_m1 = encodings.input_ids.to(device0)
    output_logits_m1 = model1(input_ids_m1)['logits'].cpu().to(dtype=torch.float32)

    input_ids_m2 = input_ids_m1.to(device1)
    output_logits_m2 = model2(input_ids_m2)['logits'].cpu().to(dtype=torch.float32)


    relative_avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).sum()/output_logits_m2.pow(2).sum()
    avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).mean()
    avg_l1 = (output_logits_m1-output_logits_m2).abs().mean()
    max_l1 = (output_logits_m1-output_logits_m2).abs().max()
    print(f"L2^2: avg_l2={avg_l2} avg_relative_l2={relative_avg_l2}")
    print(f"L1: avg_l1={avg_l1} max_elementwise_l1={max_l1}")


simple_output_comparision("facebook/opt-350m", dtype1=torch.float16, dtype2=torch.float16, use_dp=True, use_policy=False)

Expected behavior We expect numerically close outputs, say rel l2 difference around 10**(-5) for fp32 comparison or less (0.0 ideally). We see a huge discrepancy.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
async_io ............... [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.1+cu113
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.7.5+unknown, unknown, unknown
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.6

System info (please complete the following information):

OS: Ubuntu
GPU: A10G x8
Python 3.8.10

Jan 20 '23 04:01 akamaster

Hi @akamaster, I was able to recreate your issue. There was an issue with OPT injection that has been resolved in the latest Deepspeed v0.8.0. If you use that version, you should see better accuracy and that the logits are the same using HFOPTLayerPolicy vs. replace_method='auto'. I am still looking into why there is a significant difference in logits compared with just HF.

Feb 08 '23 00:02 molly-smith

Hi @molly-smith!

replace_with_kernel_inject=True on OPT (galactica-6.7b) still produces garbage on deepspeed==0.8.3.

Is this problem being fixed?

Mar 21 '23 06:03 lkm2835