ScaleLLM
ScaleLLM copied to clipboard
RuntimeError: Timed out
When I run Meta-Llama-3-8B-Instruct or Meta-Llama-3.1-8B-Instruct with
- python 3.12.5
- scalellm 0.1.9+cu118torch2.2.2
- torch 2.2.2+cu118
- torchaudio 2.2.2+cu118
- torchvision 0.17.2+cu118,
It occured
WARNING: Logging before InitGoogleLogging() is written to STDERR I20240816 14:10:56.999786 104262 llm_handler.cpp:171] Creating engine with devices: cuda:0,cuda:1,cuda:2,cuda:3,cuda:4,cuda:5,cuda:6,cuda:7 Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/home/xintianle/software/miniconda3/envs/scalellm/lib/python3.12/site-packages/scalellm/serve/api_server.py", line 125, in <module> llm_engine = AsyncLLMEngine( ^^^^^^^^^^^^^^^ File "/home/xintianle/software/miniconda3/envs/scalellm/lib/python3.12/site-packages/scalellm/llm_engine.py", line 168, in __init__ self._handler = LLMHandler(options) ^^^^^^^^^^^^^^^^^^^ RuntimeError: Timed out
Thanks for reporting the issue. It looks a NCCL communication timeout error. To investigate further, could you provide the following context:
1> environment info: python -m scalellm.utils.collect_env
2> enable NCCL logs with NCCL_DEBUG=INFO. NCCL Environment Variables
Meanwhile, is it possible for you to try scalellm with torch 2.3.0?