Mark McLoughlin
Mark McLoughlin
There are multiple cases where it may make sense to not require a gateway owner to specify a listener port: 1. Where a port number is irrelevant or nonsensical -...
The Gateway vs Route separation allows a cluster operator gateway owner to share a gateway and its resources with many application developers. However, in the L4 routing model (i.e. TCPRoute...
We should consider adding the port number to listener status in anticipation of future support for auto-assigning a port when no port is specified. By adding it in advance of...
Related to #10582. Some notes I had taken on v0 metrics implementation, along with v1 design details.
Part of #10582 and discussed in #12745 The current `vllm:lora_requests_info` Gauge is somewhat similar to an Info metric (like cache_config_info) except the value is the current wall-clock time, and is...
Part of #10582 Add a core engine `PREEMPTED` event. Add the `num_preemptions_total` counter from v0. Also, make preemptions reset the scheduled and first token timestamps resulting in: ``` > [...
Part of #10582 and discussed in #12745 Add some infrastructure to help us deprecate and remove metrics in a less user-hostile way. Our deprecation process will now be: 1) Deprecate...
Part of #10582 prometheus_client has support for Info metrics which are equivalent to a Gauge whose value is permanently set to 1, but exposes interesting key/value pair information via labels....
``` vllm:lora_requests_info{max_lora="1",running_lora_adapters="",waiting_lora_adapters=""} 1.7395575657589855e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters=""} 1.7395575723949368e+09 vllm:lora_requests_info{max_lora="1",running_lora_adapters="test-lora",waiting_lora_adapters="test-lora"} 1.7395575717647147e+09 ``` As discussed in #13303, this metric perhaps isn't the most ideal solution for the use case but, given there is an existing...
The initial implementation in #10980 went to great efforts to add parallel sampling as a wrapper at the highest layer of abstraction possible. This resulted in a lot of tricky...