Roman Koshkin

Results 11 comments of Roman Koshkin

@youkaichao Could you please share a minimal working example for offline inference with tensor-parallel-size > 1?

@youkaichao let me try se if it works for me. By the way, can you see if llama3-8b works? And what hardware / cuda are you using?

Similar problem here: ```bash singularity run \ --nv \ --env HF_HOME=/workspace/huggingface/hub \ --writable-tmpfs \ --bind $volume:/workspace/huggingface/hub \ --env HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \ docker://vllm/vllm-openai:v0.4.1 \ --model casperhansen/llama-3-70b-instruct-awq \ --tensor-parallel-size 4 ``` Everything just...

Have you fixed the issue? I can't run any model with TP > 1

@chrisbraddock Could you post minimal working code, please? And also, are running in the official vLLM docker container? If not, how did you install vLLM (from source, from pypi)? Are...

@chrisbraddock I got it working in a very similar way (I described it [here](https://github.com/vllm-project/vllm/issues/4431#issuecomment-2095138681)). The trick was to run `ray` in a separate terminal session and specify `LD_LIBRARY_PATH` correctly.

Has anyone solved this? I'm new to JAX/FLAX, so not ideas why it's taking so much memory. Though I'm quite happy with the speed

> I highly suggest your guys to use kuberay, launch a ray cluster and submit vLLM worker. That's the most easiest way I found and kuberay will reduce your chance...

@MikeBirdTech AFAIK, LMStudio is not designed to handle simultaneous requests from many clients. I have a 4xA100s box on which I run (sharded) models with tensor parallelism (for speed). That's...

Same problem here. Posted (almost the same) [errors](https://github.com/Chainlit/chainlit/issues/745#issuecomment-2009465549).