Mark McLoughlin
Mark McLoughlin
`vllm:num_requests_swapped`, `vllm:cpu_cache_usage_perc` and `vllm:cpu_prefix_cache_hit_rate` will no longer be relevant in V1 since we no longer implement KV cache offloading. So these metrics should be considered deprecated. And as agreed in...
vllm:time_in_queue_requests appears to be an exact duplicate of vllm:request_queue_time_seconds. Both record `first_scheduled_time-arrival_time`: ``` if seq_group.is_finished(): time_queue_requests.append( seq_group.metrics.first_scheduled_time - seq_group.metrics.arrival_time) ``` ``` def maybe_set_first_scheduled_time(self, time: float) -> None: if self.metrics.first_scheduled_time is...
It looks like #4464 intended to add this alongside the `vllm:iteration_tokens_total` histogram, but didn't actually hook it up and would never have appeared in `/metrics`. Since it's clearly not critical...
(WIP until #13774 merges) Part of #10582 This metric tracks the maximum of `num_generation_tokens` across a set of identical requests under a parallel sampling parent. It is the last remaining...
The V0 LLM offline inference API exposes per-request metrics via `RequestOutput.RequestMetrics`. In V1, so far we have chosen to not track per-request metrics or implement this API. All recent work...