CTranslate2
CTranslate2 copied to clipboard
Extremely slow generation speed for llama 2 70B chat model
I was able to benchmark llama 2 7B chat (int 8) and was able to get ~600 tokens in about 12s on an A100 GPU whereas the HF pipeline takes about 25s for the same input and params.
However, when I try the llama v2 70B chat model (int 8) its extremely slow (~90s) for 500 tokens vs HF pipeline which takes ~32s (although pipeline uses multiple GPUs so its not a fair comparison?). Is this expected or am I doing something wrong?
Here's my code:
import ctranslate2
CT2_INT8_MODEL_CKPT_LLAMA_7B = "llama-2-7b-chat-ct2"
CT2_INT8_MODEL_CKPT_LLAMA_70B = "llama-2-70b-chat-ct2"
generator = ctranslate2.Generator(CT2_INT8_MODEL_CKPT_LLAMA_70B, device="cuda")
tokenizer = transformers.AutoTokenizer.from_pretrained(LLAMA_PATH_7B)
def predict(prompt:str):
"Generate text give a prompt"
start = time.perf_counter()
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
results = generator.generate_batch([tokens],
sampling_temperature=0.8,
sampling_topk=0,
sampling_topp=1,
max_length=1000,
include_prompt_in_result=False)
tokens = results[0].sequences_ids[0]
output = tokenizer.decode(tokens)
request_time = time.perf_counter() - start
return {'tok_count': len(tokens),
'time': request_time,
'question': prompt,
'answer': output,
'note': 'CTranslate2 int8 quantization'}
import time
print('benchmarking ctranslate2...\n')
time_taken = []
results = []
for _ in range(10):
start = time.perf_counter()
out = predict("explain rotary positional embeddings")
print(out)
results.append(out)
request_time = time.perf_counter() - start
time_taken.append(request_time)
although pipeline uses multiple GPUs so its not a fair comparison?
Well yes, using multiple GPUs will be faster.
For CTranslate2 you might also want to use int8_float16 instead of int8.