[BUG] Deepspeed inference fp16 gives different results than HuggingFace with FlanT5-XL

Open brevity2021 opened this issue 2 years ago • 0 comments

Describe the bug I'm playing with some text generation using vanilla flanT5-XL using Deepspeed inference.

When both using fp16, the Deepspeed inference generation result diverges from the Huggingface result (and the Deepspeed result has some repetition). When using bfp16, the Deepspeed inference generation result is the same as the Huggingface result (and the Huggingface results are the same when using torch_dtype=torch.float16 and torch_dtype=torch.bfloat16.

I'm wondering what causes the difference - is there something HF uses to clamp the weights to fp16 range(as T5 was pretrained using bfp16) that Deepspeed inference doesn't use? Thanks!

To Reproduce

I was running the following script on a g5.2xlarge instance.

import os
import torch
import transformers
import deepspeed
from transformers import T5Tokenizer, T5ForConditionalGeneration

if __name__ == "__main__":
    local_rank = int(os.getenv("LOCAL_RANK", "0"))
    world_size = int(os.getenv("WORLD_SIZE", "1"))

    tokenizer_params = {
        "return_tensors": "pt",
        "truncation": True,
        "padding": "max_length",
        "max_length": 512,
    }

    inference_params = {"num_beams": 1, "max_length": 256, "early_stopping": False}
    t5_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-xl", device_map="auto")
    model = T5ForConditionalGeneration.from_pretrained(
        "google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16
    ).eval()

    test_sents = [
        "Either verbal or physical punishment increase aggression in children.",
        "Tell me if i am doing it right .",
    ]

    inputs = t5_tokenizer(test_sents, **tokenizer_params).to("cuda")

    with torch.inference_mode():
        hf_output = model.generate(**inputs, **inference_params)

    dp_model = deepspeed.init_inference(
        model, mp_size=world_size, dtype=torch.float16
    ).eval()
    with torch.inference_mode():
        dp_output = dp_model.generate(**inputs, **inference_params)

    print("Dp outtput:")
    dp_decoded = t5_tokenizer.batch_decode(dp_output, skip_special_tokens=True)
    print(dp_decoded)
    print("HF output:")
    hf_decoded = t5_tokenizer.batch_decode(hf_output, skip_special_tokens=True)
    print(hf_decoded)

Dp outtput: ['y verbal physical punishments verbal physical punishment', 'tell me if i am doing it right'] HF output: ['Physical punishment is more likely to increase aggression in children.', 'i am trying to get a job.']

If we change both torch_dtype to bfloat16: model = T5ForConditionalGeneration.from_pretrained( "google/flan-t5-xl", device_map="auto", torch_dtype=torch.bfloat16 ).eval()

dp_model = deepspeed.init_inference( model, mp_size=world_size, dtype=torch.bfloat16 ).eval(), the result will be the same:

Dp outtput: ['Physical punishment is more likely to increase aggression in children.', 'i am trying to get a job.'] HF output: ['Physical punishment is more likely to increase aggression in children.', 'i am trying to get a job.']

Expected behavior Deepspeed should output the same result as HuggingFace when using fp16.

Ds_report output

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/xx/deepspeed_test_env/lib/python3.9/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/home/xx/deepspeed_test_env/lib/python3.9/site-packages/deepspeed'] deepspeed info ................... 0.8.3, unknown, unknown torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.7 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Screenshots If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types: 1, a10g
Hugging Face Transformers versions: 4.27.4
Python version: 3.9.1

Apr 10 '23 17:04 brevity2021