Roman Koshkin
Roman Koshkin
@youkaichao Could you please share a minimal working example for offline inference with tensor-parallel-size > 1?
@youkaichao let me try se if it works for me. By the way, can you see if llama3-8b works? And what hardware / cuda are you using?
Similar problem here: ```bash singularity run \ --nv \ --env HF_HOME=/workspace/huggingface/hub \ --writable-tmpfs \ --bind $volume:/workspace/huggingface/hub \ --env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \ docker://vllm/vllm-openai:v0.4.1 \ --model casperhansen/llama-3-70b-instruct-awq \ --tensor-parallel-size 4 ``` Everything just...
Have you fixed the issue? I can't run any model with TP > 1
@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are...
@chrisbraddock I got it working in a very similar way (I described it [here](https://github.com/vllm-project/vllm/issues/4431#issuecomment-2095138681)). The trick was to run `ray` in a separate terminal session and specify `LD_LIBRARY_PATH` correctly.
Has anyone solved this? I'm new to JAX/FLAX, so not ideas why it's taking so much memory. Though I'm quite happy with the speed
> I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance...
@MikeBirdTech AFAIK, LMStudio is not designed to handle simultaneous requests from many clients. I have a 4xA100s box on which I run (sharded) models with tensor parallelism (for speed). That's...
Same problem here. Posted (almost the same) [errors](https://github.com/Chainlit/chainlit/issues/745#issuecomment-2009465549).