FasterTransformer GPT2 FP8 don't work on L4/L40 platform

Branch/Tag/Commit

main

Docker Image Version

nvcr.io/nvidia/pytorch:23.04-py3

GPU name

L4

CUDA Driver

525.105.17

Reproduced Steps

1. docker run --gpus all -it --rm --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/pytorch:23.04-py3 bash
2. Run the following commands inside the docker:

#!/bin/bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
cmake -DSM=89 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -DENABLE_FP8=ON ..
make -j

git clone https://huggingface.co/gpt2-xl
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ./models/huggingface-models/c-model/gpt2-xl -i_g 1

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir models/345m/ -p
unzip megatron_lm_345m_v0.0.zip -d ./models/345m

export PYTHONPATH=$PWD/..:${PYTHONPATH}
python3 ../examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py \
      -i ./models/345m/release \
      -o ./models/345m/c-model/ \
      -i_g 1 \
      -head_num 16 \
      -trained_tensor_parallel_size 1

python3 ../examples/pytorch/gpt/gpt_summarization.py \
        --data_type fp8 \
        --lib_path ./lib/libth_transformer.so \
        --summarize \
        --ft_model_location ./models/345m/c-model/ \
        --hf_model_location ./gpt2-xl/

Jun 01 '23 06:06 champson

Get error:

top_k: 1 top_p: 0.0 int8_mode: 0 random_seed: 5 temperature: 1 max_seq_len: 1024 max_batch_size: 1 repetition_penalty: 1 vocab_size: 50304 tensor_para_size: 1 pipeline_para_size: 1 lib_path: ./lib/libth_transformer.so ckpt_path: ./models/345m/c-model/1-gpu hf_config: {'activation_function': 'gelu_new', 'architectures': ['GPT2LMHeadModel'], 'attn_pdrop': 0.1, 'bos_token_id': 50256, 'embd_pdrop': 0.1, 'eos_token_id': 50256, 'initializer_range': 0.02, 'layer_norm_epsilon': 1e-05, 'model_type': 'gpt2', 'n_ctx': 1024, 'n_embd': 1600, 'n_head': 25, 'n_layer': 48, 'n_positions': 1024, 'output_past': True, 'resid_pdrop': 0.1, 'summary_activation': None, 'summary_first_dropout': 0.1, 'summary_proj_to_labels': True, 'summary_type': 'cls_index', 'summary_use_proj': True, 'task_specific_params': {'text-generation': {'do_sample': True, 'max_length': 50}}, 'vocab_size': 50257} [FT][WARNING] Skip NCCL initialization since requested tensor/pipeline parallel sizes are equals to 1. [FT][ERROR] CUDA runtime error: CUBLAS_STATUS_INVALID_VALUE /root/FasterTransformer/src/fastertransformer/utils/cublasFP8MMWrapper.cu:277

Jun 01 '23 06:06 champson

@champson were you able to get FasterTransformer working on an L4 GPU?

Sep 15 '23 22:09 shannonphu

FasterTransformer development has transitioned to TensorRT-LLM.

TensorRT-LLM has supported GPT2 FP8 on L40.

Oct 20 '23 07:10 byshiue