FastChat Multi-gpu demo failed on two A6000

Thanks for the great work! When going through the tutorial I can successfully run vicuna on single A6000:

However, when trying to accelerate things with 2 gpus I found it crashed. With CUDA_LAUNCH_BLOCKING=1 python3 -m fastchat.serve.cli --model-path /path/to/7b-model --num-gpus 2 over 2 A6000, it crashed with:

Curious if I made any misconfiguration from the user side? Thanks!

Apr 14 '23 02:04 ganler

$ pip show fschat                
Name: fschat
Version: 0.2.1
Summary: An open platform for training, serving, and evaluating large language model based chatbots.
Home-page: 
Author: 
Author-email: 
License: 
Location: /home/jiawei/.conda/envs/g/lib/python3.8/site-packages
Requires: accelerate, fastapi, gradio, markdown2, numpy, prompt-toolkit, requests, rich, sentencepiece, tokenizers, torch, uvicorn, wandb

$ pip show transformers
Name: transformers
Version: 4.29.0.dev0
Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow
Home-page: https://github.com/huggingface/transformers
Author: The Hugging Face team (past and future) with the help of all our contributors (https://github.com/huggingface/transformers/graphs/contributors)
Author-email: [email protected]
License: Apache 2.0 License
Location: /home/jiawei/.conda/envs/g/lib/python3.8/site-packages
Requires: filelock, huggingface-hub, numpy, packaging, pyyaml, regex, requests, tokenizers, tqdm

Apr 14 '23 03:04 ganler

What is the VRAM on your GPU? By default FastChat sets the max memory to 13GiB when using multiple GPUs which can lead to a total of 26 with two GPUs = not enough

Apr 14 '23 16:04 nielstron

A6000 has 50gb vram.

Apr 14 '23 16:04 ganler

Can you try changing this line to "40GiB"? https://github.com/lm-sys/FastChat/blob/e112299e3c3529de0c430b98fd24743296d273f4/fastchat/serve/inference.py#L29

Apr 14 '23 16:04 nielstron

What is the VRAM on your GPU? By default FastChat sets the max memory to 13GiB when using multiple GPUs which can lead to a total of 26 with two GPUs = not enough

I have been running the 13B model on 2 GPUs for some time, and that could explain the random (but rare) crashes. How much memory is enough memory?

Apr 14 '23 17:04 fungiboletus

For me it usually takes around 40GiB. It never worked with 2 GPUs @ 13GiB for me :thinking: what am I doing wrong?

Apr 14 '23 17:04 nielstron

For me it usually takes around 40GiB. It never worked with 2 GPUs @ 13GiB for me 🤔 what am I doing wrong?

With two A30, nvidia-smi reports around 14GiB on each GPU. I'm using cuda 11.7 and pytorch 2.0.0 on linux.

Apr 14 '23 17:04 fungiboletus

Can you try changing this line to "40GiB"?

https://github.com/lm-sys/FastChat/blob/e112299e3c3529de0c430b98fd24743296d273f4/fastchat/serve/inference.py#L29

Unfortunately, it does not work for the 7b model. It kept saying ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2,0,0], thread: [104,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed..

Apr 14 '23 20:04 ganler

Same issue experienced.

Apr 16 '23 00:04 weiddeng

@ganler @weiddeng It is a tokenizer version issue.

https://github.com/lm-sys/FastChat/issues/199#issuecomment-1537618299 Please refer to the above issue for the solution. Let us know if it is solved. Feel free to re-open.

May 08 '23 07:05 zhisbug