Alex Chen
Alex Chen
Oh, I have the similar issue here: INFO 10-12 22:21:19 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 27.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs,...
> This is probably related: #9032 > > The guided decoding is super slow, and seems to block up the engine so that it can't report its health status yes,...
I meet similar situation here, the smartness and accuracy of groq's llama3-70b-8192 model is much better than my llama3:70b-instruct-fp16 power by ollama. I don't have any clue about why. I...
> @alexchenyu How large are your prompts? Ours are around 3.5K. My prompt is quite long, over 4k, I thinks maybe that's the reason, and after I switch to vLLM,...