DeepSpeed [BUG] Incorrect logits/loss/outputs on GPT-NEOX-20B

Describe the bug DeepSpeed(DS) optimized GPT-NEOX-20B model produces incorrect logits and loss, which hurts overall model accuracy. Numerical differences on final logits for a test phrase (see code) is as follows:

	avg L2	rel L2	avg L1	max elem. diff
DS fp32 vs ACC fp32	23.070	0.00819222	3.458	36.657
DS fp16 vs ACC fp16	24.700	0.0087669	3.497	40.742

We expect a relative l2 being around 10^(-5) or smaller. The computed loss on this test phrase is 4.08 (DS FP32) vs 2.42 (HF/ACC FP32), and 4.34 (DS FP16) vs 2.43 (HF/ACC FP16). There is a huge difference in computed loss/logits which does affect overall model's quality. The bug manifests on longer text sequences, probably due to some accumulation of numerical errors. The bug affects the latest (mainline) deepspeed and version 0.7.5. (Other versions might be impacted as well, didn't test)

Note: we are comparing against HuggingFace's Accelerate output since the model does not fit a single gpu in my machine.

Here the following formulas are used:

avg l2 diff is computed as $\sum_i \frac{1}{N}(x_i-y_i)^2$
rel l2 diff as $\frac{\sum_i (x_i-y_i)^2}{\sum_i y_i^2}$
avg l1 diff is computed as $\sum_i \frac{1}{N}|x_i-y_i|$

where $x$ is output of model1 (DS or HF) and $y$ is output of model2 (ground truth).

To Reproduce To reproduce the behavior run the following script with different options. The script simply runs deepspeed version and compares the output against the pure huggingface/accelerate version.

Script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import deepspeed 
import torch
import os
import transformers

TEST_INPUT = """DeepSpeed is an open source deep learning optimization library for PyTorch.[1] The library is designed to reduce computing power and memory use and to train large distributed models with better parallelism on existing computer hardware.[2][3] DeepSpeed is optimized for low latency, high throughput training. It includes the Zero Redundancy Optimizer (ZeRO) for training models with 1 trillion or more parameters.[4] Features include mixed precision training, single-GPU, multi-GPU, and multi-node training as well as custom model parallelism. The DeepSpeed source code is licensed under MIT License and available on GitHub.[5]

The team claimed to achieve up to a 6.2x throughput improvement, 2.8x faster convergence, and 4.6x less communication.[6] """

def simple_output_DS(dtype):
    model_id = 'EleutherAI/gpt-neox-20b'
    model = AutoModelForCausalLM.from_pretrained(model_id).to(device='cpu', dtype=dtype)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    world_size = int(os.getenv('WORLD_SIZE', '1'))
    local_rank = int(os.getenv('LOCAL_RANK', '0'))

    model = deepspeed.init_inference(model, 
        mp_size=world_size, 
        dtype=dtype, 
        replace_method='auto', 
        replace_with_kernel_inject=True)

    encodings = tokenizer(TEST_INPUT, return_tensors="pt")

    input_ids = encodings.input_ids.to(f'cuda:{local_rank}')
    out = model(input_ids, labels=input_ids)
    output_logits = out['logits'].cpu().to(dtype=torch.float32)
    if local_rank == 0:
        torch.save((output_logits, out['loss']), f"logits-DS-{dtype}.pt")
        print(out['loss'])



def simple_output_ACC(dtype):
    local_rank = int(os.getenv('LOCAL_RANK', '0'))
    model_id = 'EleutherAI/gpt-neox-20b'
    model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto").to(dtype=dtype)
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    
    encodings = tokenizer(TEST_INPUT, return_tensors="pt")

    input_ids = encodings.input_ids.to(f'cuda:{local_rank}')
    out = model(input_ids, labels=input_ids.clone().to(model.get_output_embeddings().weight.device))
    output_logits = out['logits'].cpu().to(dtype=torch.float32)
    if local_rank == 0:
        torch.save((output_logits, out['loss']),  f"logits-HF-{dtype}.pt")
        print(out['loss'])

def compare_outputs(dtype):
    output_logits_m1, loss_m1 = torch.load(f"logits-DS-{dtype}.pt")
    output_logits_m2, loss_m2 = torch.load(f"logits-HF-{dtype}.pt")


    relative_avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).sum()/output_logits_m2.pow(2).sum()
    avg_l2 = (output_logits_m1 - output_logits_m2).pow(2).mean()
    avg_l1 = (output_logits_m1-output_logits_m2).abs().mean()
    max_l1 = (output_logits_m1-output_logits_m2).abs().max()
    print(f"loss ds: {loss_m1.item()}")
    print(f"loss hf: {loss_m2.item()}")
    print(f"L2^2: avg_l2={avg_l2:.8f} avg_relative_l2={relative_avg_l2:.8f}; (in scientific notation: {relative_avg_l2:.5e}")
    print(f"L1: avg_l1={avg_l1:.8f} max_elementwise_l1={max_l1:.8f}")

import argparse
if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Command line tool to check LLM correctness')
    parser.add_argument('--local_rank', type=int)
    parser.add_argument('--use_ds', action='store_true')
    parser.add_argument('--use_hf', action='store_true')
    parser.add_argument('--compare', action='store_true')
    parser.add_argument('--dtype', choices=['fp16', 'fp32'], default='fp16')

    args = parser.parse_args()

    if args.dtype == 'fp32':
        dtype = torch.float32
    elif args.dtype == 'fp16':
        dtype = torch.float16

    if args.use_ds:
        simple_output_DS(dtype)
    elif args.use_hf:
        simple_output_ACC(dtype)
    elif args.compare:
        compare_outputs(dtype)

Save the code above as comparison.py and execute it as:

deepspeed comparison.py --dtype fp16 --use_ds
python comparison.py --dtype fp16 --use_hf
python comparison.py --dtype fp16 --compare

Note: you need to have HF's accelerate installed. Expected behavior We expect numerically close outputs, say rel l2 difference around 10**(-5) for fp32 comparison or less (0.0 ideally). We see a huge discrepancy.

ds_report output

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
async_io ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
sparse_attn ............ [NO] ....... [OKAY]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.8/dist-packages/torch']
torch version .................... 1.12.1+cu113
deepspeed install path ........... ['/usr/local/lib/python3.8/dist-packages/deepspeed']
deepspeed info ................... 0.8.0+fe728e3e, fe728e3e, master
torch cuda version ............... 11.3
torch hip version ................ None
nvcc version ..................... 11.3
deepspeed wheel compiled w. ...... torch 1.12, cuda 11.3

System info (please complete the following information):

OS: Ubuntu
GPU: A10G x8
Python 3.8.10

Feb 02 '23 06:02 akamaster

Hi @akamaster, I ran this with the latest deepspeed and observed better results for fp16: loss ds: 2.4296875 loss hf: 2.431640625 L2^2: avg_l2=0.10383601 avg_relative_l2=0.00003690; (in scientific notation: 3.68962e-05 L1: avg_l1=0.21918750 max_elementwise_l1=2.79687500

Mar 10 '23 01:03 molly-smith

Also, I think we do not support fp32. @RezaYazdaniAminabadi can you confirm?

Mar 10 '23 01:03 molly-smith

fp32 is not currently supported, but we may add that soon.

Mar 23 '23 01:03 molly-smith