RuntimeError: Timed out

Open spongxin opened this issue 1 year ago • 1 comments

When I run Meta-Llama-3-8B-Instruct or Meta-Llama-3.1-8B-Instruct with

python 3.12.5
scalellm 0.1.9+cu118torch2.2.2
torch 2.2.2+cu118
torchaudio 2.2.2+cu118
torchvision 0.17.2+cu118, It occured WARNING: Logging before InitGoogleLogging() is written to STDERR I20240816 14:10:56.999786 104262 llm_handler.cpp:171] Creating engine with devices: cuda:0,cuda:1,cuda:2,cuda:3,cuda:4,cuda:5,cuda:6,cuda:7 Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/xintianle/software/miniconda3/envs/scalellm/lib/python3.12/site-packages/scalellm/serve/api_server.py", line 125, in <module> llm_engine = AsyncLLMEngine( ^^^^^^^^^^^^^^^ File "/home/xintianle/software/miniconda3/envs/scalellm/lib/python3.12/site-packages/scalellm/llm_engine.py", line 168, in __init__ self._handler = LLMHandler(options) ^^^^^^^^^^^^^^^^^^^ RuntimeError: Timed out

Aug 16 '24 06:08 spongxin

Thanks for reporting the issue. It looks a NCCL communication timeout error. To investigate further, could you provide the following context: 1> environment info: python -m scalellm.utils.collect_env 2> enable NCCL logs with NCCL_DEBUG=INFO. NCCL Environment Variables

Meanwhile, is it possible for you to try scalellm with torch 2.3.0?

Aug 17 '24 15:08 guocuimi