fix(deps): update dependency vllm to ^0.8.0 [security]

Open dreadnode-renovate-bot[bot] opened this issue 8 months ago • 1 comments

This PR contains the following updates:

Package	Type	Update	Change
vllm	extras	minor	`^0.5.0` -> `^0.8.0`

GitHub Vulnerability Alerts

CVE-2025-24357

Description

The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.

Impact

This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.

Note that most models now use the safetensors format, which is not vulnerable to this issue.

References

https://pytorch.org/docs/stable/generated/torch.load.html
Fix: https://github.com/vllm-project/vllm/pull/12366

CVE-2025-25183

Summary

Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.

Details

vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.

Impact

The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.

Solution

We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.

Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.

To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.

References

https://github.com/vllm-project/vllm/pull/12621
https://github.com/python/cpython/commit/432117cd1f59c76d97da2eaff55a7d758301dbc7
https://github.com/python/cpython/pull/99541

CVE-2025-29770

Impact

The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server.

The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. vLLM should have this off by default and allow administrators to opt-in due to the potential for abuse.

A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space.

Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the guided_decoding_backend key of the extra_body field of the request.

This issue applies to the V0 engine only. The V1 engine is not affected.

Patches

https://github.com/vllm-project/vllm/pull/14837

The fix is to disable this cache by default since it does not provide an option to limit its size. If you want to use this cache anyway, you may set the VLLM_V0_USE_OUTLINES_CACHE environment variable to 1.

Workarounds

There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server.

References

GHSA-ggpf-24jw-3fcw

Description

https://github.com/vllm-project/vllm/security/advisories/GHSA-rh4j-5rhw-hr54 reported a vulnerability where loading a malicious model could result in code execution on the vllm host. The fix applied to specify weights_only=True to calls to torch.load() did not solve the problem prior to PyTorch 2.6.0.

PyTorch has issued a new CVE about this problem: https://github.com/advisories/GHSA-53q9-r3pm-6pq6

This means that versions of vLLM using PyTorch before 2.6.0 are vulnerable to this problem.

Background Knowledge

When users install VLLM according to the official manual

But the version of PyTorch is specified in the requirements. txt file

So by default when the user install VLLM, it will install the PyTorch with version 2.5.1

In CVE-2025-24357, weights_only=True was used for patching, but we know this is not secure. Because we found that using Weights_only=True in pyTorch before 2.5.1 was unsafe

Here, we use this interface to prove that it is not safe.

Fix

update PyTorch version to 2.6.0

Credit

This vulnerability was found By Ji'an Zhou and Li'shuo Song

CVE-2025-30202

Impact

In a multi-node vLLM deployment, vLLM uses ZeroMQ for some multi-node communication purposes. The primary vLLM host opens an XPUB ZeroMQ socket and binds it to ALL interfaces. While the socket is always opened for a multi-node deployment, it is only used when doing tensor parallelism across multiple hosts.

Any client with network access to this host can connect to this XPUB socket unless its port is blocked by a firewall. Once connected, these arbitrary clients will receive all of the same data broadcasted to all of the secondary vLLM hosts. This data is internal vLLM state information that is not useful to an attacker.

By potentially connecting to this socket many times and not reading data published to them, an attacker can also cause a denial of service by slowing down or potentially blocking the publisher.

Detailed Analysis

The XPUB socket in question is created here:

https://github.com/vllm-project/vllm/blob/c21b99b91241409c2fdf9f3f8c542e8748b317be/vllm/distributed/device_communicators/shm_broadcast.py#L236-L237

Data is published over this socket via MessageQueue.enqueue() which is called by MessageQueue.broadcast_object():

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L452-L453

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L475-L478

The MessageQueue.broadcast_object() method is called by the GroupCoordinator.broadcast_object() method in parallel_state.py:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L364-L366

The broadcast over ZeroMQ is only done if the GroupCoordinator was created with use_message_queue_broadcaster set to True:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L216-L219

The only case where GroupCoordinator is created with use_message_queue_broadcaster is the coordinator for the tensor parallelism group:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L931-L936

To determine what data is broadcasted to the tensor parallism group, we must continue tracing. GroupCoordinator.broadcast_object() is called by GroupCoordinator.broadcoast_tensor_dict():

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L489

which is called by broadcast_tensor_dict() in communication_op.py:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/communication_op.py#L29-L34

If we look at _get_driver_input_and_broadcast() in the V0 worker_base.py, we'll see how this tensor dict is formed:

https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/worker/worker_base.py#L332-L352

but the data actually sent over ZeroMQ is the metadata_list portion that is split from this tensor_dict. The tensor parts are sent via torch.distributed and only metadata about those tensors is sent via ZeroMQ.

https://github.com/vllm-project/vllm/blob/54a66e5fee4a1ea62f1e4c79a078b20668e408c6/vllm/distributed/parallel_state.py#L61-L83

Patches

https://github.com/vllm-project/vllm/pull/17197

Workarounds

Prior to the fix, your options include:

Do not expose the vLLM host to a network where any untrusted connections may reach the host.
Ensure that only the other vLLM hosts are able to connect to the TCP port used for the XPUB socket. Note that port used is random.

References

Relevant code first introduced in https://github.com/vllm-project/vllm/pull/6183

Release Notes

vllm-project/vllm (vllm)

`v0.8.5`

Compare Source

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
Add ModernBERT (#16648)
Add Granite Speech Support (#16246)
Add PLaMo2 (#14323)
Add Kimi-VL model support (#16387)
Add Qwen2.5-Omni model support (thinker only) (#15130)
Snowflake Arctic Embed (Family) (#16649)
Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

Add structural_tag support using xgrammar (#17085)
Disaggregated serving:
- KV Connector API V1 (#15960)
- Adding LMCache KV connector for v1 (#16625)
Clean up: Remove Sampler from Model Code (#17084)
MLA: Simplification to batch P/D reordering (#16673)
Move usage stats to worker and start logging TPU hardware (#16211)
Support FlashInfer Attention (#16684)
Faster incremental detokenization (#15137)
EAGLE-3 Support (#16937)

Features

Validate urls object for multimodal content parts (#16990)
Prototype support sequence parallelism using compilation pass (#16155)
Add sampling params to v1/audio/transcriptions endpoint (#16591)
Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

Attention:
- FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
- Update to lastest FA3 code (#13111)
- Support Cutlass MLA for Blackwell GPUs (#16032)
MoE:
- Add expert_map support to Cutlass FP8 MOE (#16861)
- Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

TPU:
- Enable structured decoding on TPU V1 (#16499)
- Capture multimodal encoder during model compilation (#15051)
- Enable Top-P (#16843)
AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

Add open-webui example (#16747)
Document Matryoshka Representation Learning support (#16770)
Add a security guide (#17230)
Add example to run DeepSeek with Ray Serve LLM (#17134)
Benchmarks for audio models (#16505)

Security and Dependency Updates

Don't bind tcp zmq socket to all interfaces (#17197)
Use safe serialization and fix zmq setup for mooncake pipe (#17192)
Bump Transformers to 4.51.3 (#17116)

Build and testing

Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

--enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (https://github.com/vllm-project/vllm/pull/16533)

What's Changed

Improve configs - SchedulerConfig by @hmellor in https://github.com/vllm-project/vllm/pull/16533
[Misc] remove warning if triton>=3.2.0 by @DefTruth in https://github.com/vllm-project/vllm/pull/16553
[Misc] refactor examples by @reidliu41 in https://github.com/vllm-project/vllm/pull/16563
[Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in https://github.com/vllm-project/vllm/pull/16523
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in https://github.com/vllm-project/vllm/pull/16048
[Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16593
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 by @NickLucche in https://github.com/vllm-project/vllm/pull/16596
Fix triton install condition on CPU by @hmellor in https://github.com/vllm-project/vllm/pull/16600
s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in https://github.com/vllm-project/vllm/pull/16036
[Model][VLM] Add Kimi-VL model support by @courage17340 in https://github.com/vllm-project/vllm/pull/16387
[Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in https://github.com/vllm-project/vllm/pull/16616
[DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in https://github.com/vllm-project/vllm/pull/16614
config check sleep mode support oot platforms by @celestialli in https://github.com/vllm-project/vllm/pull/16562
[Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/16390
[Kernel] moe wna16 marlin kernel by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/14447
[BugFix]: Update minimum pyzmq version by @taneem-ibrahim in https://github.com/vllm-project/vllm/pull/16549
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/16623
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/16631
Add vllm bench [latency, throughput] CLI commands by @mgoin in https://github.com/vllm-project/vllm/pull/16508
Fix vLLM x torch.compile config caching by @zou3519 in https://github.com/vllm-project/vllm/pull/16491
[Misc] refactor argument parsing in examples by @reidliu41 in https://github.com/vllm-project/vllm/pull/16635
[CI/Build] Fix LoRA OOM by @jeejeelee in https://github.com/vllm-project/vllm/pull/16624
Add "/server_info" endpoint in api_server to retrieve the vllm_config. by @Cangxihui in https://github.com/vllm-project/vllm/pull/16572
[Kernel] Remove redundant Exp calculations by @DefTruth in https://github.com/vllm-project/vllm/pull/16123
[Misc] Update compressed-tensors WNA16 to support zero-points by @dsikka in https://github.com/vllm-project/vllm/pull/14211
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in https://github.com/vllm-project/vllm/pull/10546
[Model] Add PLaMo2 by @Alnusjaponica in https://github.com/vllm-project/vllm/pull/14323
[Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in https://github.com/vllm-project/vllm/pull/16628
[Misc] Modify LRUCache touch by @jeejeelee in https://github.com/vllm-project/vllm/pull/16689
Disable remote caching when calling compile_fx by @zou3519 in https://github.com/vllm-project/vllm/pull/16611
[Feature] add model aware kv ops helper by @billishyahao in https://github.com/vllm-project/vllm/pull/16020
[ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in https://github.com/vllm-project/vllm/pull/16664
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py by @shen-shanshan in https://github.com/vllm-project/vllm/pull/16578
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook by @yankay in https://github.com/vllm-project/vllm/pull/16405
[Misc] refactor examples series by @reidliu41 in https://github.com/vllm-project/vllm/pull/16708
[Doc] Improve OOM troubleshooting by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16704
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in https://github.com/vllm-project/vllm/pull/16693
[Model] support modernbert by @xsank in https://github.com/vllm-project/vllm/pull/16648
[Hardware] Add processor inputs to platform validation by @joerunde in https://github.com/vllm-project/vllm/pull/16680
Improve error for structured output backend selection by @hmellor in https://github.com/vllm-project/vllm/pull/16717
[Misc] Remove redundant comment by @jianzs in https://github.com/vllm-project/vllm/pull/16703
Help user create custom model for Transformers backend remote code models by @hmellor in https://github.com/vllm-project/vllm/pull/16719
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in https://github.com/vllm-project/vllm/pull/16432
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in https://github.com/vllm-project/vllm/pull/16636
Adding vllm buildkite job for IBM Power by @AaruniAggarwal in https://github.com/vllm-project/vllm/pull/16679
[V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11737
[rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in https://github.com/vllm-project/vllm/pull/16426
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in https://github.com/vllm-project/vllm/pull/16734
[Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in https://github.com/vllm-project/vllm/pull/16741
[V1] Remove log noise when idle by @russellb in https://github.com/vllm-project/vllm/pull/16735
[Ray] Improve documentation on batch inference by @richardliaw in https://github.com/vllm-project/vllm/pull/16609
[misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in https://github.com/vllm-project/vllm/pull/16760
[Doc] Add more tips to avoid OOM by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16765
[doc] add open-webui example by @reidliu41 in https://github.com/vllm-project/vllm/pull/16747
[Bugfix] Fix GLM4 model by @intervitens in https://github.com/vllm-project/vllm/pull/16618
[Doc] Fix a 404 link in installation/cpu.md by @windsonsea in https://github.com/vllm-project/vllm/pull/16773
[Misc] refactor examples series - lmcache by @reidliu41 in https://github.com/vllm-project/vllm/pull/16758
Improve configs - TokenizerPoolConfig + DeviceConfig by @hmellor in https://github.com/vllm-project/vllm/pull/16603
fix: hyperlink by @reidliu41 in https://github.com/vllm-project/vllm/pull/16778
[Doc] Make sure to update vLLM when installing latest code by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16781
[Doc] Document Matryoshka Representation Learning support by @noooop in https://github.com/vllm-project/vllm/pull/16770
[Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion by @insukim1994 in https://github.com/vllm-project/vllm/pull/16784
[V1][Perf] Faster incremental detokenization by @njhill in https://github.com/vllm-project/vllm/pull/15137
[Bugfix]Fix index out of range error in api server log by @WangErXiao in https://github.com/vllm-project/vllm/pull/16787
[Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 by @Ximingwang-09 in https://github.com/vllm-project/vllm/pull/16753
[Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small by @lengrongfu in https://github.com/vllm-project/vllm/pull/16548
[TPU][V1] Fix padding recompilation when max-num-batched-tokens is not even by @NickLucche in https://github.com/vllm-project/vllm/pull/16726
[V1][TPU] Enable Top K by @NickLucche in https://github.com/vllm-project/vllm/pull/15489
[ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints by @sijiac in https://github.com/vllm-project/vllm/pull/16674
[V1][Metrics] Fix http metrics middleware by @markmc in https://github.com/vllm-project/vllm/pull/15894
[MLA] Simplification to batch P/D reordering by @njhill in https://github.com/vllm-project/vllm/pull/16673
[P/D][V1] KV Connector API V1 by @ApostaC in https://github.com/vllm-project/vllm/pull/15960
[Attention] Update to lastest FA3 code by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/13111
Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema by @tarukumar in https://github.com/vllm-project/vllm/pull/16721
[Doc] Improve help examples for --compilation-config by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16729
[Misc] Update outdated note: LMCache now supports chunked prefill by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16697
[V1][Structured Output] Minor modification to _validate_structured_output() by @shen-shanshan in https://github.com/vllm-project/vllm/pull/16748
Add hardware print to TPU V1 test by @mgoin in https://github.com/vllm-project/vllm/pull/16792
[BugFix] Accuracy fix for llama4 int4 - improperly casted scales by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/16801
Improve configs - MultiModalConfig + PoolerConfig + DecodingConfig by @hmellor in https://github.com/vllm-project/vllm/pull/16789
[Misc] add collect_env to cli and docker image by @lengrongfu in https://github.com/vllm-project/vllm/pull/16759
[ROCm] [Attention] Cleanup ROCm output passing by @ProExpertProg in https://github.com/vllm-project/vllm/pull/16431
[Bugfix] fix pp for llama4 by @luccafong in https://github.com/vllm-project/vllm/pull/16746
[Doc] add podman setup instructions for official image by @nathan-weinberg in https://github.com/vllm-project/vllm/pull/16796
[Docs] Fix a link and grammar issue in production-stack.md by @windsonsea in https://github.com/vllm-project/vllm/pull/16809
[Model] use AutoWeightsLoader for BigCode, GPT-J by @jonghyunchoe in https://github.com/vllm-project/vllm/pull/16823
[Misc] Clean up Kimi-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16833
Fix nullable_kvs fallback by @hmellor in https://github.com/vllm-project/vllm/pull/16837
[New Model]: Snowflake Arctic Embed (Family) by @noooop in https://github.com/vllm-project/vllm/pull/16649
[Misc] refactor examples series - Chat Completion Client With Tools by @reidliu41 in https://github.com/vllm-project/vllm/pull/16829
[Doc] Updated Llama section in tool calling docs to have llama 3.2 config info by @jmho in https://github.com/vllm-project/vllm/pull/16857
publish neuron docker image by @omrishiv in https://github.com/vllm-project/vllm/pull/16733
[Model][VLM] Add Qwen2.5-Omni model support (thinker only) by @fyabc in https://github.com/vllm-project/vllm/pull/15130
[rocm][MI300] llama4 maverick fp8 moe config tp8 by @divakar-amd in https://github.com/vllm-project/vllm/pull/16847
[Frontend] Add sampling params to v1/audio/transcriptions endpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/16591
[Misc] Benchmarks for audio models by @NickLucche in https://github.com/vllm-project/vllm/pull/16505
[V1][Misc] stop update prefix cache stats when logs_stats is disabled by @vie-serendipity in https://github.com/vllm-project/vllm/pull/16460
[Model] Refactor Phi-4-multimodal to use merged processor and support V1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/15477
[Model] Qwen2.5-Omni Cleanup by @ywang96 in https://github.com/vllm-project/vllm/pull/16872
[VLM] Clean up models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16873
[doc] update hyperlink by @reidliu41 in https://github.com/vllm-project/vllm/pull/16877
Log how much time loading a compiled artifact takes by @zou3519 in https://github.com/vllm-project/vllm/pull/16848
Serialize tensors using int8 views by @p88h in https://github.com/vllm-project/vllm/pull/16866
Improve configs - CacheConfig by @hmellor in https://github.com/vllm-project/vllm/pull/16835
[easy] Pass compile_fx only the config patches by @zou3519 in https://github.com/vllm-project/vllm/pull/16845
[Bugfix] Fix v1/spec_decode/test_ngram.py by @zixi-qi in https://github.com/vllm-project/vllm/pull/16895
[CI/CD][V1] Add spec decode tests to CI by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16900
[Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in https://github.com/vllm-project/vllm/pull/16907
[Doc] Split dummy_processor_inputs() in Multimodal Docs by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/16915
Restore buffers when wake up from level 2 sleep (#16564) by @fingertap in https://github.com/vllm-project/vllm/pull/16889
[Misc] fix collect_env version parse by @wangxiyuan in https://github.com/vllm-project/vllm/pull/15267
[Misc] Refactor platform to get device specific stream and event by @shen-shanshan in https://github.com/vllm-project/vllm/pull/14411
[Bugfix] Fix GLM rotary_dim issue and support v1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/16912
Raise error for data-parallel with benchmark_throughput by @kartikx in https://github.com/vllm-project/vllm/pull/16737
[XPU][Bugfix] minor fix for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/15591
[doc] install required python3-dev apt package by @davidxia in https://github.com/vllm-project/vllm/pull/16888
[Doc] mention how to install in CPU editable mode by @davidxia in https://github.com/vllm-project/vllm/pull/16923
[Core] Speed up decode by remove synchronizing operation in sampler by @chanh in https://github.com/vllm-project/vllm/pull/16436
[V1][Spec Decode] Handle draft tokens beyond max_model_len by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16087
[TPU][V1] Implicitly adjust page size when there's SMEM OOM by @yaochengji in https://github.com/vllm-project/vllm/pull/16871
Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml by @mgoin in https://github.com/vllm-project/vllm/pull/16946
[TPU][V1] Capture multimodal encoder during model compilation by @NickLucche in https://github.com/vllm-project/vllm/pull/15051
[V1] V1 FlashInfer Attention by @mgoin in https://github.com/vllm-project/vllm/pull/16684
[TPU][V1] Enable Top-P by @NickLucche in https://github.com/vllm-project/vllm/pull/16843
[Doc] Remove unnecessary V1 flag by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16924
[BugFix][Spec Decode] No in-place update to draft probs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16952
[Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other by @jeffrey-dot-li in https://github.com/vllm-project/vllm/pull/16863
[ROCm] Add aiter tkw1 kernel for Llama4 fp8 by @kliuae in https://github.com/vllm-project/vllm/pull/16727
[Misc] Remove the chunked prefill warning for LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/16925
[Kernel] Add expert_map support to Cutlass FP8 MOE by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/16861
[V1] Remove additional_config check by @wangxiyuan in https://github.com/vllm-project/vllm/pull/16710
[Performance][ROCm] Add skinny gemms for unquantized linear on ROCm by @charlifu in https://github.com/vllm-project/vllm/pull/15830
Support S3 Sharded loading with RunAI Model Streamer by @omer-dayan in https://github.com/vllm-project/vllm/pull/16317
[Bugfix] Fix f-string for Python 3.9-3.11 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16962
[Doc] Update ai_accelerator/hpu-gaudi.inc.md by @windsonsea in https://github.com/vllm-project/vllm/pull/16956
[Perf] Optimize _update_states for GPU model runner by @SnowCharmQ in https://github.com/vllm-project/vllm/pull/16910
[Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16767
[Model] Use autoweightloader for mamba by @sfeng33 in https://github.com/vllm-project/vllm/pull/16950
[V1] Remove pre-allocation for KV cache by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16941
[Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS by @LeiWang1999 in https://github.com/vllm-project/vllm/pull/6036
[BugFix] Fix incremental detokenization perf issue by @njhill in https://github.com/vllm-project/vllm/pull/16963
[Doc] Improve documentation for multimodal CLI args by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16960
[FEAT][ROCm] Integrate Paged Attention Kernel from AITER by @vllmellm in https://github.com/vllm-project/vllm/pull/15001
[Misc] refactor example series by @reidliu41 in https://github.com/vllm-project/vllm/pull/16972
[Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in https://github.com/vllm-project/vllm/pull/16974
Improve configs - SpeculativeConfig by @hmellor in https://github.com/vllm-project/vllm/pull/16971
[BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) by @timzsu in https://github.com/vllm-project/vllm/pull/16973
[Misc] Add S3 environment variables for better support of MinIO. by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16977
[frontend] enhance tool_calls type check by @reidliu41 in https://github.com/vllm-project/vllm/pull/16882
[FEAT][ROCm]: Support AITER MLA by @vllmellm in https://github.com/vllm-project/vllm/pull/15893
Add assertion for no objects while hashing hf_config by @zou3519 in https://github.com/vllm-project/vllm/pull/16930
Fencing Kernels Tests for enabling on AMD by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/16929
[BugFix] Remove default multiproc executor collective_rpc timeout by @njhill in https://github.com/vllm-project/vllm/pull/17000
[Core][V1][TPU] Enable structured decoding on TPU V1 by @Chenyaaang in https://github.com/vllm-project/vllm/pull/16499
[Bugfix] validate urls object for multimodal content parts by @gcalmettes in https://github.com/vllm-project/vllm/pull/16990
add Dockerfile build vllm against torch nightly by @yangw-dev in https://github.com/vllm-project/vllm/pull/16936
[Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 by @maleksan85 in https://github.com/vllm-project/vllm/pull/13305
[V1][DP] More robust DP/EP dummy request coordination by @njhill in https://github.com/vllm-project/vllm/pull/16277
[BugFix] Revert ROCm Custom Paged Attention Env Flag Check by @vllmellm in https://github.com/vllm-project/vllm/pull/17022
Revert "[Misc] Add S3 environment variables for better support of MinIO." by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17021
[misc] tune some env vars for GB200 by @youkaichao in https://github.com/vllm-project/vllm/pull/16992
[INTEL-HPU][v0] Port delayed sampling to upstream by @xuechendi in https://github.com/vllm-project/vllm/pull/16949
[doc] add download path tips by @reidliu41 in https://github.com/vllm-project/vllm/pull/17013
[Bugfix] Triton FA function takes no keyword arguments by @vllmellm in https://github.com/vllm-project/vllm/pull/16902
[V1] Avoid socket errors during shutdown when requests are in in-flight by @njhill in https://github.com/vllm-project/vllm/pull/16807
[BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/16998
[Misc] Improve readability of get_open_port function. by @gitover22 in https://github.com/vllm-project/vllm/pull/17024
[Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16964
[CI] Run v1/test_serial_utils.py in CI by @russellb in https://github.com/vllm-project/vllm/pull/16996
Mistral-format support for compressed-tensors by @mgoin in https://github.com/vllm-project/vllm/pull/16803
Categorize tests/kernels/ based on kernel type by @mgoin in https://github.com/vllm-project/vllm/pull/16799
[Doc] Add top anchor and a note to quantization/bitblas.md by @windsonsea in https://github.com/vllm-project/vllm/pull/17042
Ensure that pid passed to kill_process_tree is int for mypy by @hmellor in https://github.com/vllm-project/vllm/pull/17051
[CI] Update structured-output label automation by @russellb in https://github.com/vllm-project/vllm/pull/17055
Improve Transformers backend model loading QoL by @hmellor in https://github.com/vllm-project/vllm/pull/17039
CacheConfig.block_size should always be int when used by @hmellor in https://github.com/vllm-project/vllm/pull/17052
Use @property and private field for data_parallel_rank_local by @hmellor in [https://github.com/vllm-project/vllm/pull/17053](https://redirect.github.com/vllm-project/vllm/pull/170

Configuration

📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).

🚦 Automerge: Enabled.

♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

[ ] If you want to rebase/retry this PR, check this box

This PR has been generated by Renovate Bot.

May 09 '25 20:05 dreadnode-renovate-bot[bot]

Edited/Blocked Notification

Renovate will not automatically rebase this PR, because it does not recognize the last commit author and assumes somebody else may have edited the PR.

You can manually request rebase by checking the rebase/retry box above.

⚠️ Warning: custom changes will be lost.

May 10 '25 20:05 dreadnode-renovate-bot[bot]