FasterTransformer GPT2 + FP8 example does not work

Branch/Tag/Commit

main

Docker Image Version

nvcr.io/nvidia/pytorch:23.02-py3

GPU name

H100 MIG

CUDA Driver

525.85.12

Reproduced Steps

1. `docker run --privileged --gpus '"device=MIG-***"' --network=host --shm-size 32g --memory 128g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --name mig-test -itd nvcr.io/nvidi
a/pytorch:23.02-py3`

2. Run the following commands inside the docker:

#!/bin/bash

git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
cmake -DSM=90 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -DENABLE_FP8=ON ..
make -j

git clone https://huggingface.co/gpt2-xl
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ./models/huggingface-models/c-model/gpt2-xl -i_g 1

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir models/345m/ -p
unzip megatron_lm_345m_v0.0.zip -d ./models/345m

export PYTHONPATH=$PWD/..:${PYTHONPATH}
python3 ../examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py \
      -i ./models/345m/release \
      -o ./models/345m/c-model/ \
      -i_g 1 \
      -head_num 16 \
      -trained_tensor_parallel_size 1

python3 ../examples/pytorch/gpt/gpt_summarization.py \
        --data_type fp8 \
        --lib_path ./lib/libth_transformer.so \
        --summarize \
        --ft_model_location ./models/345m/c-model/ \
        --hf_model_location ./gpt2-xl/

Received the below errors:

Reusing dataset cnn_dailymail (/workdir/datasets/ccdv/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1084.45it/s]
top_k: 1
top_p: 0.0
int8_mode: 0
random_seed: 5
temperature: 1
max_seq_len: 1024
max_batch_size: 1
repetition_penalty: 1
vocab_size: 50304
tensor_para_size: 1
pipeline_para_size: 1
lib_path: ./lib/libth_transformer.so
ckpt_path: ./models/345m/c-model/1-gpu
hf_config: {'activation_function': 'gelu_new', 'architectures': ['GPT2LMHeadModel'], 'attn_pdrop': 0.1, 'bos_token_id': 50256, 'embd_pdrop': 0.1, 'eos_token_id': 50256, 'initializer_range': 0.02, 'layer_norm_epsilon': 1e-05, 'model_type': 'gpt2', 'n_ctx': 1024, 'n_embd': 1600, 'n_head': 25, 'n_layer': 48, 'n_positions': 1024, 'output_past': True, 'resid_pdrop': 0.1, 'summary_activation': None, 'summary_first_dropout': 0.1, 'summary_proj_to_labels': True, 'summary_type': 'cls_index', 'summary_use_proj': True, 'task_specific_params': {'text-generation': {'do_sample': True, 'max_length': 50}}, 'vocab_size': 50257}
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE /home/.../FasterTransformer/src/fastertransformer/utils/cublasFP8MMWrapper.cu:277

Mar 22 '23 05:03 feihugis

Can you try the docker image recommended in the document?

Mar 22 '23 05:03 byshiue

Can you try the docker image recommended in the document?

Thanks @byshiue for your quick reply! Do you mean this docker image nvcr.io/nvidia/pytorch:22.09-py3?

Mar 22 '23 05:03 feihugis

Just tried nvcr.io/nvidia/pytorch:22.09-py3, but still got the same error message.

Mar 22 '23 06:03 feihugis

I also ran the below commands to tune gemm, but fp8 is multiple times slower than fp16 in 8 of 11 cases (please check the last column (speedup) in the below table). Is it expected?

./bin/gpt_gemm 8 1 32 12 128 6144 51200 4 1 1
./bin/gpt_gemm 8 1 32 12 128 6144 51200 1 1 1

batch_size	seq_len	head_num	size_per_head	dataType	###	batchCount	n	m	k	algoId	customOption	tile	numSplitsK	swizzle	reductionScheme	workspaceSize	stages	exec_time	speedup
8	32	12	128	4	###	1	4608	256	1536	52	0	31	1	0	0	131	36	0.013807	0.825235
8	32	12	128	4	###	96	32	32	128	52	0	31	1	0	0	131	36	0.009926	2.618997
8	32	12	128	4	###	96	128	32	32	52	0	31	1	0	0	131	36	0.009689	2.590642
8	32	12	128	4	###	1	1536	256	1536	52	0	20	1	0	0	131	36	0.011924	1.216735
8	32	12	128	4	###	1	6144	256	1536	52	0	31	-1	0	2	131	36	0.013776	0.849793
8	32	12	128	4	###	1	1536	256	6144	52	0	20	1	0	0	131	36	0.022243	1.277599
8	1	12	128	4	###	1	4608	8	1536	52	0	31	1	0	0	131	36	0.013838	1.576082
8	1	12	128	4	###	1	1536	8	1536	52	0	31	1	0	0	131	36	0.013509	2.182391
8	1	12	128	4	###	1	6144	8	1536	52	0	31	1	0	0	131	36	0.013831	1.5266
8	1	12	128	4	###	1	1536	8	6144	52	0	31	1	0	0	131	36	0.028986	2.848187
8	1	12	128	4	###	1	51200	8	1536	52	0	31	1	0	0	131	36	0.028986	0.484068
8	32	12	128	1	###	1	4608	256	1536	39	0	18	-1	0	2	131	35	0.016731
8	32	12	128	1	###	96	32	32	128	113	-1	-1	-1	-1	-1	-1	-1	0.00379
8	32	12	128	1	###	96	128	32	32	110	-1	-1	-1	-1	-1	-1	-1	0.00374
8	32	12	128	1	###	1	1536	256	1536	114	-1	-1	-1	-1	-1	-1	-1	0.0098
8	32	12	128	1	###	1	6144	256	1536	39	0	20	-1	0	2	131	35	0.016211
8	32	12	128	1	###	1	1536	256	6144	109	-1	-1	-1	-1	-1	-1	-1	0.01741
8	1	12	128	1	###	1	4608	8	1536	115	-1	-1	-1	-1	-1	-1	-1	0.00878
8	1	12	128	1	###	1	1536	8	1536	114	-1	-1	-1	-1	-1	-1	-1	0.00619
8	1	12	128	1	###	1	6144	8	1536	100	-1	-1	-1	-1	-1	-1	-1	0.00906
8	1	12	128	1	###	1	1536	8	6144	21	0	15	11	0	2	540672	16	0.010177
8	1	12	128	1	###	1	51200	8	1536	110	-1	-1	-1	-1	-1	-1	-1	0.05988

Mar 22 '23 23:03 feihugis

FP8 does not support gemm tuning now.

We cannot reproduce your issue. Can you run the program with environment variable FT_DEBUG_LEVEL=DEBUG again? We also guess it may be caused by MIG. Can you try running on full H100 first?

Mar 23 '23 00:03 byshiue

Got the same error in H100 and H100-MIG as below:

[FT][ERROR] CUDA runtime error: an illegal memory access was encountered FasterTransformer/src/fastertransformer/models/gpt_fp8/GptFP8ContextDecoder.cc:243

Mar 23 '23 18:03 feihugis

Can you post your scripts and full log?

Mar 24 '23 00:03 byshiue

Can you post your scripts and full log?

Hi @byshiue, I create new docker containers to test it again. For nvcr.io/nvidia/pytorch:22.09-py3, I confirm it works well now (not quite sure why it failed last time), but I also notified that FP8 is slower than FP16 based on the logging info as below. It seems to be unexpected.

FP16:

Faster Transformers (total latency: 3.274726152420044 sec)
rouge1 : 21.664839291439336
rouge2 : 5.412904663794084
rougeL : 15.354753879504434
rougeLsum : 18.928251038380523

FP8:

Faster Transformers (total latency: 4.783168077468872 sec)
rouge1 : 23.47552607890326
rouge2 : 6.86675093270573
rougeL : 17.23756077641688
rougeLsum : 20.871397598597387

For nvcr.io/nvidia/pytorch:23.02-py3, it failed as below.

Do you know how the newer version of pytorch docker causes the failure?

root@rl-dgxh-r72-u24:/workspace/FasterTransformer# python3 examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py    
   -i ./models/345m/release       -o ./models/345m/c-model/       -i_g 1       -head_num 16       -trained_tensor_para
llel_size 1

=============== Argument ===============
saved_dir: ./models/345m/c-model/
in_file: ./models/345m/release
infer_gpu_num: 1
head_num: 16
trained_tensor_parallel_size: 1
processes: 16
weight_data_type: fp32
load_checkpoints_to_cpu: 1
vocab_path: None
merges_path: None
========================================
[INFO] Spent 0:00:03.138548 (h:m:s) to convert the model
root@rl-dgxh-r72-u24:/workspace/FasterTransformer# python3 examples/pytorch/gpt/gpt_summarization.py         --data_ty
pe fp8         --lib_path ./build/lib/libth_transformer.so         --summarize         --ft_model_location ./models/34
5m/c-model/         --hf_model_location ./gpt2-xl/
/workspace/FasterTransformer/examples/pytorch/gpt/utils/gpt.py:221: SyntaxWarning: assertion is always true, perhaps r
emove parentheses?
  assert(self.pre_embed_idx < self.post_embed_idx, "Pre decoder embedding index should be lower than post decoder embe
dding index.")
Reusing dataset cnn_dailymail (/workdir/datasets/ccdv/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134f
c22da53ea75852027040d8d1e997f101f)
100%|█████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1038.62it/s]
top_k: 1
top_p: 0.0
int8_mode: 0
random_seed: 5
temperature: 1
max_seq_len: 1024
max_batch_size: 1
repetition_penalty: 1
vocab_size: 50304
tensor_para_size: 1
pipeline_para_size: 1
lib_path: ./build/lib/libth_transformer.so
ckpt_path: ./models/345m/c-model/1-gpu
hf_config: {'activation_function': 'gelu_new', 'architectures': ['GPT2LMHeadModel'], 'attn_pdrop': 0.1, 'bos_token_id'
: 50256, 'embd_pdrop': 0.1, 'eos_token_id': 50256, 'initializer_range': 0.02, 'layer_norm_epsilon': 1e-05, 'model_type
': 'gpt2', 'n_ctx': 1024, 'n_embd': 1600, 'n_head': 25, 'n_layer': 48, 'n_positions': 1024, 'output_past': True, 'resi
d_pdrop': 0.1, 'summary_activation': None, 'summary_first_dropout': 0.1, 'summary_proj_to_labels': True, 'summary_type
': 'cls_index', 'summary_use_proj': True, 'task_specific_params': {'text-generation': {'do_sample': True, 'max_length'
: 50}}, 'vocab_size': 50257}
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][ERROR] CUDA runtime error: an illegal memory access was encountered /workspace/FasterTransformer/src/fastertransf
ormer/models/gpt_fp8/GptFP8ContextDecoder.cc:243

Here is the script I used to run tests:

docker run  --privileged --gpus '"device=0"' --network=host --shm-size 32g --memory 128g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --name ft-h100-test -itd nvcr.io/nvidia/pytorch:22.09-py3

git clone https://github.com/NVIDIA/FasterTransformer.git
cd FasterTransformer && git submodule init && git submodule update
mkdir build && cd build
cmake -DSM=90 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -DENABLE_FP8=ON ..
make -j
pip install -r ../examples/pytorch/gpt/requirement.txt

cd ../ && git clone https://huggingface.co/gpt2-xl
python examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ./models/huggingface-models/c-model/gpt2-xl -i_g 1

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir models/345m/ -p && unzip megatron_lm_345m_v0.0.zip -d ./models/345m

export PYTHONPATH=$PWD:${PYTHONPATH}
export FT_DEBUG_LEVEL=DEBUG
python3 examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py \
      -i ./models/345m/release \
      -o ./models/345m/c-model/ \
      -i_g 1 \
      -head_num 16 \
      -trained_tensor_parallel_size 1

python3 examples/pytorch/gpt/gpt_summarization.py \
        --data_type fp8 \
        --lib_path ./build/lib/libth_transformer.so \
        --summarize \
        --ft_model_location ./models/345m/c-model/ \
        --hf_model_location ./gpt2-xl/

Mar 24 '23 22:03 feihugis

It may be the cublaslt of latest docker image have some updates. We don't verify on this version yet.

For performance between fp16 and fp8, fp8 only brings speedup when the batch size is large enough. But the batch size in the example is only 1.

Mar 28 '23 01:03 byshiue

For performance between fp16 and fp8, fp8 only brings speedup when the batch size is large enough. But the batch size in the example is only 1.

I made some minor changes in gpt_summarization.py to make fp8 work with the batch input. When the shape of line_encoded = [64, 658], fp16 took around 0.60 seconds, but fp8 took around 1.77 seconds. Is it expected? Not sure if there are any issues in the dev environment setup. Could you please share some rough perf numbers for the perf of FP8 if you have? For larger input shape, fp8 met runtime error without any log info even when setting export FT_DEBUG_LEVEL=DEBUG.

Mar 29 '23 22:03 feihugis

Can you use nsys to run the profiling for your test? For example,

nsys profile -o fp8 python gpt_summarization.py --data_type fp8

Mar 30 '23 01:03 byshiue

Hi @byshiue , I run into the same issue as mentioned by feihugis. Here I use the example where the batch_size=1. Any update on this issue?

Jul 12 '23 19:07 wohenniubi

Branch/Tag/Commit

main

Docker Image Version

nvcr.io/nvidia/pytorch:23.02-py3

GPU name

H100 MIG

CUDA Driver

525.85.12

Reproduced Steps

1. `docker run --privileged --gpus '"device=MIG-***"' --network=host --shm-size 32g --memory 128g --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --name mig-test -itd nvcr.io/nvidi
a/pytorch:23.02-py3`

2. Run the following commands inside the docker:

#!/bin/bash

git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
cmake -DSM=90 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -DENABLE_FP8=ON ..
make -j

git clone https://huggingface.co/gpt2-xl
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ./models/huggingface-models/c-model/gpt2-xl -i_g 1

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir models/345m/ -p
unzip megatron_lm_345m_v0.0.zip -d ./models/345m

export PYTHONPATH=$PWD/..:${PYTHONPATH}
python3 ../examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py \
      -i ./models/345m/release \
      -o ./models/345m/c-model/ \
      -i_g 1 \
      -head_num 16 \
      -trained_tensor_parallel_size 1

python3 ../examples/pytorch/gpt/gpt_summarization.py \
        --data_type fp8 \
        --lib_path ./lib/libth_transformer.so \
        --summarize \
        --ft_model_location ./models/345m/c-model/ \
        --hf_model_location ./gpt2-xl/

Received the below errors:

Reusing dataset cnn_dailymail (/workdir/datasets/ccdv/ccdv___cnn_dailymail/3.0.0/3.0.0/0107f7388b5c6fae455a5661bcd134fc22da53ea75852027040d8d1e997f101f)
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 1084.45it/s]
top_k: 1
top_p: 0.0
int8_mode: 0
random_seed: 5
temperature: 1
max_seq_len: 1024
max_batch_size: 1
repetition_penalty: 1
vocab_size: 50304
tensor_para_size: 1
pipeline_para_size: 1
lib_path: ./lib/libth_transformer.so
ckpt_path: ./models/345m/c-model/1-gpu
hf_config: {'activation_function': 'gelu_new', 'architectures': ['GPT2LMHeadModel'], 'attn_pdrop': 0.1, 'bos_token_id': 50256, 'embd_pdrop': 0.1, 'eos_token_id': 50256, 'initializer_range': 0.02, 'layer_norm_epsilon': 1e-05, 'model_type': 'gpt2', 'n_ctx': 1024, 'n_embd': 1600, 'n_head': 25, 'n_layer': 48, 'n_positions': 1024, 'output_past': True, 'resid_pdrop': 0.1, 'summary_activation': None, 'summary_first_dropout': 0.1, 'summary_proj_to_labels': True, 'summary_type': 'cls_index', 'summary_use_proj': True, 'task_specific_params': {'text-generation': {'do_sample': True, 'max_length': 50}}, 'vocab_size': 50257}
[FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1.
[WARNING] gemm_config.in is not found; using default GEMM algo
[FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE /home/.../FasterTransformer/src/fastertransformer/utils/cublasFP8MMWrapper.cu:277

@feihugis Hi, I have some questions and I hope to receive your advice.

Why use --hf_model_location ./gpt2-xl/ ? megatron_lm_345m does not have a corresponding hf model format, and currently the ft8 model does not support hf_ fp8, this parameter seems to be of little use here.
When I use --weights_data_type fp16, the model will load fails, as follows: ` Reusing dataset cnn_dailymail (/root/.cache/huggingface/datasets/cnn_dailymail/3.0.0/3.0.0/a88add311826ada312ac2321c9f0cf00dcc10f72c4b832fbab9dadae87052b48) 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 272.51it/s] top_k: 1 top_p: 0.0 int8_mode: 0 random_seed: 5 temperature: 1 max_seq_len: 1024 max_batch_size: 1 repetition_penalty: 1 vocab_size: 50304 tensor_para_size: 1 pipeline_para_size: 1 lib_path: ./lib/libth_transformer.so ckpt_path: ../models/345m/c-model/1-gpu hf_config: {'activation_function': 'gelu_new', 'architectures': ['GPT2LMHeadModel'], 'attn_pdrop': 0.1, 'bos_token_id': 50256, 'embd_pdrop': 0.1, 'eos_token_id': 50256, 'initializer_range': 0.02, 'layer_norm_epsilon': 1e-05, 'model_type': 'gpt2', 'n_ctx': 1024, 'n_embd': 1600, 'n_head': 25, 'n_layer': 48, 'n_positions': 1024, 'output_past': True, 'resid_pdrop': 0.1, 'summary_activation': None, 'summary_first_dropout': 0.1, 'summary_proj_to_labels': True, 'summary_type': 'cls_index', 'summary_use_proj': True, 'task_specific_params': {'text-generation': {'do_sample': True, 'max_length': 50}}, 'vocab_size': 50257} [FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1. [FT][WARNING] file ../models/345m/c-model/1-gpu/model.wpe.bin only has 2097152, but request 4194304, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.wte.bin only has 103022592, but request 206045184, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.final_layernorm.bias.bin only has 2048, but request 4096, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.final_layernorm.weight.bin only has 2048, but request 4096, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.wte.bin only has 103022592, but request 206045184, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.layers.0.input_layernorm.bias.bin only has 2048, but request 4096, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.layers.0.input_layernorm.weight.bin only has 2048, but request 4096, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.layers.0.attention.query_key_value.bias.0.bin only has 6144, but request 12288, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.layers.0.attention.dense.bias.bin only has 2048, but request 4096, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.layers.0.post_attention_layernorm.bias.bin only has 2048, but request 4096, loading model fails!

[FT][WARNING] file ../models/345m/c-model/1-gpu/model.layers.0.post_attention_layernorm.weight.bin only has 2048, but request 4096, loading model fails! ` Can the ../examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py and ../examples/pytorch/gpt/gpt_summarization.py only use the weight of fp32 here？

Aug 10 '23 10:08 double-vin

Hi @byshiue , I run into the same issue as mentioned by feihugis. Here I use the example where the batch_size=1. Any update on this issue?

@wohenniubi May I ask how to control parameters and commands to distinguish between FP16 results and FP8 results here. First, select fp8 or fp16 through the --data_type parameter. Secondly, for both fp8 and fp16, are the --weights_data_type parameters set to fp16? Thirdly, how do you set the --ft_model_location parameters for fp8 and fp16. For fp8, use the ../models/345m/c-model/ path, but for fp16, what model path do you use ？The fp16 model for converting megatron_ckpt_convert.py or megatron_fp8_ckpt_convert.py scripts?

Aug 14 '23 08:08 double-vin