fix(deps): update dependency vllm to ^0.8.0 [security]
This PR contains the following updates:
| Package | Type | Update | Change |
|---|---|---|---|
| vllm | extras | minor | ^0.5.0 -> ^0.8.0 |
GitHub Vulnerability Alerts
CVE-2025-24357
Description
The vllm/model_executor/weight_utils.py implements hf_model_weights_iterator to load the model checkpoint, which is downloaded from huggingface. It use torch.load function and weights_only parameter is default value False. There is a security warning on https://pytorch.org/docs/stable/generated/torch.load.html, when torch.load load a malicious pickle data it will execute arbitrary code during unpickling.
Impact
This vulnerability can be exploited to execute arbitrary codes and OS commands in the victim machine who fetch the pretrained repo remotely.
Note that most models now use the safetensors format, which is not vulnerable to this issue.
References
- https://pytorch.org/docs/stable/generated/torch.load.html
- Fix: https://github.com/vllm-project/vllm/pull/12366
CVE-2025-25183
Summary
Maliciously constructed prompts can lead to hash collisions, resulting in prefix cache reuse, which can interfere with subsequent responses and cause unintended behavior.
Details
vLLM's prefix caching makes use of Python's built-in hash() function. As of Python 3.12, the behavior of hash(None) has changed to be a predictable constant value. This makes it more feasible that someone could try exploit hash collisions.
Impact
The impact of a collision would be using cache that was generated using different content. Given knowledge of prompts in use and predictable hashing behavior, someone could intentionally populate the cache using a prompt known to collide with another prompt in use.
Solution
We address this problem by initializing hashes in vllm with a value that is no longer constant and predictable. It will be different each time vllm runs. This restores behavior we got in Python versions prior to 3.12.
Using a hashing algorithm that is less prone to collision (like sha256, for example) would be the best way to avoid the possibility of a collision. However, it would have an impact to both performance and memory footprint. Hash collisions may still occur, though they are no longer straight forward to predict.
To give an idea of the likelihood of a collision, for randomly generated hash values (assuming the hash generation built into Python is uniformly distributed), with a cache capacity of 50,000 messages and an average prompt length of 300, a collision will occur on average once every 1 trillion requests.
References
- https://github.com/vllm-project/vllm/pull/12621
- https://github.com/python/cpython/commit/432117cd1f59c76d97da2eaff55a7d758301dbc7
- https://github.com/python/cpython/pull/99541
CVE-2025-29770
Impact
The outlines library is one of the backends used by vLLM to support structured output (a.k.a. guided decoding). Outlines provides an optional cache for its compiled grammars on the local filesystem. This cache has been on by default in vLLM. Outlines is also available by default through the OpenAI compatible API server.
The affected code in vLLM is vllm/model_executor/guided_decoding/outlines_logits_processors.py, which unconditionally uses the cache from outlines. vLLM should have this off by default and allow administrators to opt-in due to the potential for abuse.
A malicious user can send a stream of very short decoding requests with unique schemas, resulting in an addition to the cache for each request. This can result in a Denial of Service if the filesystem runs out of space.
Note that even if vLLM was configured to use a different backend by default, it is still possible to choose outlines on a per-request basis using the guided_decoding_backend key of the extra_body field of the request.
This issue applies to the V0 engine only. The V1 engine is not affected.
Patches
The fix is to disable this cache by default since it does not provide an option to limit its size. If you want to use this cache anyway, you may set the VLLM_V0_USE_OUTLINES_CACHE environment variable to 1.
Workarounds
There is no way to workaround this issue in existing versions of vLLM other than preventing untrusted access to the OpenAI compatible API server.
References
GHSA-ggpf-24jw-3fcw
Description
https://github.com/vllm-project/vllm/security/advisories/GHSA-rh4j-5rhw-hr54 reported a vulnerability where loading a malicious model could result in code execution on the vllm host. The fix applied to specify weights_only=True to calls to torch.load() did not solve the problem prior to PyTorch 2.6.0.
PyTorch has issued a new CVE about this problem: https://github.com/advisories/GHSA-53q9-r3pm-6pq6
This means that versions of vLLM using PyTorch before 2.6.0 are vulnerable to this problem.
Background Knowledge
When users install VLLM according to the official manual
But the version of PyTorch is specified in the requirements. txt file
So by default when the user install VLLM, it will install the PyTorch with version 2.5.1
In CVE-2025-24357, weights_only=True was used for patching, but we know this is not secure. Because we found that using Weights_only=True in pyTorch before 2.5.1 was unsafe
Here, we use this interface to prove that it is not safe.
Fix
update PyTorch version to 2.6.0
Credit
This vulnerability was found By Ji'an Zhou and Li'shuo Song
CVE-2025-30202
Impact
In a multi-node vLLM deployment, vLLM uses ZeroMQ for some multi-node communication purposes. The primary vLLM host opens an XPUB ZeroMQ socket and binds it to ALL interfaces. While the socket is always opened for a multi-node deployment, it is only used when doing tensor parallelism across multiple hosts.
Any client with network access to this host can connect to this XPUB socket unless its port is blocked by a firewall. Once connected, these arbitrary clients will receive all of the same data broadcasted to all of the secondary vLLM hosts. This data is internal vLLM state information that is not useful to an attacker.
By potentially connecting to this socket many times and not reading data published to them, an attacker can also cause a denial of service by slowing down or potentially blocking the publisher.
Detailed Analysis
The XPUB socket in question is created here:
https://github.com/vllm-project/vllm/blob/c21b99b91241409c2fdf9f3f8c542e8748b317be/vllm/distributed/device_communicators/shm_broadcast.py#L236-L237
Data is published over this socket via MessageQueue.enqueue() which is called by MessageQueue.broadcast_object():
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L452-L453
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/device_communicators/shm_broadcast.py#L475-L478
The MessageQueue.broadcast_object() method is called by the GroupCoordinator.broadcast_object() method in parallel_state.py:
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L364-L366
The broadcast over ZeroMQ is only done if the GroupCoordinator was created with use_message_queue_broadcaster set to True:
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L216-L219
The only case where GroupCoordinator is created with use_message_queue_broadcaster is the coordinator for the tensor parallelism group:
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L931-L936
To determine what data is broadcasted to the tensor parallism group, we must continue tracing. GroupCoordinator.broadcast_object() is called by GroupCoordinator.broadcoast_tensor_dict():
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/parallel_state.py#L489
which is called by broadcast_tensor_dict() in communication_op.py:
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/distributed/communication_op.py#L29-L34
If we look at _get_driver_input_and_broadcast() in the V0 worker_base.py, we'll see how this tensor dict is formed:
https://github.com/vllm-project/vllm/blob/790b79750b596043036b9fcbee885827fdd2ef3d/vllm/worker/worker_base.py#L332-L352
but the data actually sent over ZeroMQ is the metadata_list portion that is split from this tensor_dict. The tensor parts are sent via torch.distributed and only metadata about those tensors is sent via ZeroMQ.
https://github.com/vllm-project/vllm/blob/54a66e5fee4a1ea62f1e4c79a078b20668e408c6/vllm/distributed/parallel_state.py#L61-L83
Patches
Workarounds
Prior to the fix, your options include:
- Do not expose the vLLM host to a network where any untrusted connections may reach the host.
- Ensure that only the other vLLM hosts are able to connect to the TCP port used for the
XPUBsocket. Note that port used is random.
References
- Relevant code first introduced in https://github.com/vllm-project/vllm/pull/6183
Release Notes
vllm-project/vllm (vllm)
v0.8.5
This release contains 310 commits from 143 contributors (55 new contributors!).
Highlights
This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.
Model Support
- Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
- Add ModernBERT (#16648)
- Add Granite Speech Support (#16246)
- Add PLaMo2 (#14323)
- Add Kimi-VL model support (#16387)
- Add Qwen2.5-Omni model support (thinker only) (#15130)
- Snowflake Arctic Embed (Family) (#16649)
- Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)
V1 Engine
- Add
structural_tagsupport using xgrammar (#17085) - Disaggregated serving:
- Clean up: Remove Sampler from Model Code (#17084)
- MLA: Simplification to batch P/D reordering (#16673)
- Move usage stats to worker and start logging TPU hardware (#16211)
- Support FlashInfer Attention (#16684)
- Faster incremental detokenization (#15137)
- EAGLE-3 Support (#16937)
Features
- Validate urls object for multimodal content parts (#16990)
- Prototype support sequence parallelism using compilation pass (#16155)
- Add sampling params to
v1/audio/transcriptionsendpoint (#16591) - Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
- Add
vllm bench [latency, throughput]CLI commands (#16508)
Performance
- Attention:
- MoE:
- Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
- Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)
Hardwares
- TPU:
- AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)
Documentation
- Add open-webui example (#16747)
- Document Matryoshka Representation Learning support (#16770)
- Add a security guide (#17230)
- Add example to run DeepSeek with Ray Serve LLM (#17134)
- Benchmarks for audio models (#16505)
Security and Dependency Updates
- Don't bind tcp zmq socket to all interfaces (#17197)
- Use safe serialization and fix zmq setup for mooncake pipe (#17192)
- Bump Transformers to 4.51.3 (#17116)
Build and testing
- Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)
Breaking changes 🚨
-
--enable-chunked-prefill,--multi-step-stream-outputs,--disable-chunked-mm-inputcan no longer explicitly be set toFalse. Instead, addno-to the start of the argument (i.e.--enable-chunked-prefilland--no-enable-chunked-prefill) (https://github.com/vllm-project/vllm/pull/16533)
What's Changed
- Improve configs -
SchedulerConfigby @hmellor in https://github.com/vllm-project/vllm/pull/16533 - [Misc] remove warning if triton>=3.2.0 by @DefTruth in https://github.com/vllm-project/vllm/pull/16553
- [Misc] refactor examples by @reidliu41 in https://github.com/vllm-project/vllm/pull/16563
- [Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in https://github.com/vllm-project/vllm/pull/16523
- [fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in https://github.com/vllm-project/vllm/pull/16048
- [Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16593
- [TPU][V1] Fix exponential padding when
max-num-batched-tokensis not a power of 2 by @NickLucche in https://github.com/vllm-project/vllm/pull/16596 - Fix triton install condition on CPU by @hmellor in https://github.com/vllm-project/vllm/pull/16600
- s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in https://github.com/vllm-project/vllm/pull/16036
- [Model][VLM] Add Kimi-VL model support by @courage17340 in https://github.com/vllm-project/vllm/pull/16387
- [Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in https://github.com/vllm-project/vllm/pull/16616
- [DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in https://github.com/vllm-project/vllm/pull/16614
- config check sleep mode support oot platforms by @celestialli in https://github.com/vllm-project/vllm/pull/16562
- [Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/16390
- [Kernel] moe wna16 marlin kernel by @jinzhen-lin in https://github.com/vllm-project/vllm/pull/14447
- [BugFix]: Update minimum
pyzmqversion by @taneem-ibrahim in https://github.com/vllm-project/vllm/pull/16549 - [Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in https://github.com/vllm-project/vllm/pull/16623
- [Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in https://github.com/vllm-project/vllm/pull/16631
- Add
vllm bench [latency, throughput]CLI commands by @mgoin in https://github.com/vllm-project/vllm/pull/16508 - Fix vLLM x torch.compile config caching by @zou3519 in https://github.com/vllm-project/vllm/pull/16491
- [Misc] refactor argument parsing in examples by @reidliu41 in https://github.com/vllm-project/vllm/pull/16635
- [CI/Build] Fix LoRA OOM by @jeejeelee in https://github.com/vllm-project/vllm/pull/16624
- Add "/server_info" endpoint in api_server to retrieve the vllm_config. by @Cangxihui in https://github.com/vllm-project/vllm/pull/16572
- [Kernel] Remove redundant Exp calculations by @DefTruth in https://github.com/vllm-project/vllm/pull/16123
- [Misc] Update
compressed-tensorsWNA16 to support zero-points by @dsikka in https://github.com/vllm-project/vllm/pull/14211 - [Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in https://github.com/vllm-project/vllm/pull/10546
- [Model] Add PLaMo2 by @Alnusjaponica in https://github.com/vllm-project/vllm/pull/14323
- [Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in https://github.com/vllm-project/vllm/pull/16628
- [Misc] Modify LRUCache touch by @jeejeelee in https://github.com/vllm-project/vllm/pull/16689
- Disable remote caching when calling compile_fx by @zou3519 in https://github.com/vllm-project/vllm/pull/16611
- [Feature] add model aware kv ops helper by @billishyahao in https://github.com/vllm-project/vllm/pull/16020
- [ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in https://github.com/vllm-project/vllm/pull/16664
- [V1][Structured Output] Move xgrammar related utils to
backend_xgrammar.pyby @shen-shanshan in https://github.com/vllm-project/vllm/pull/16578 - [CI] Cleanup
additional_dependencies: [toml]for pre-commit yapf hook by @yankay in https://github.com/vllm-project/vllm/pull/16405 - [Misc] refactor examples series by @reidliu41 in https://github.com/vllm-project/vllm/pull/16708
- [Doc] Improve OOM troubleshooting by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16704
- [Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in https://github.com/vllm-project/vllm/pull/16693
- [Model] support modernbert by @xsank in https://github.com/vllm-project/vllm/pull/16648
- [Hardware] Add processor inputs to platform validation by @joerunde in https://github.com/vllm-project/vllm/pull/16680
- Improve error for structured output backend selection by @hmellor in https://github.com/vllm-project/vllm/pull/16717
- [Misc] Remove redundant comment by @jianzs in https://github.com/vllm-project/vllm/pull/16703
- Help user create custom model for Transformers backend remote code models by @hmellor in https://github.com/vllm-project/vllm/pull/16719
- [V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in https://github.com/vllm-project/vllm/pull/16432
- [V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in https://github.com/vllm-project/vllm/pull/16636
- Adding vllm buildkite job for IBM Power by @AaruniAggarwal in https://github.com/vllm-project/vllm/pull/16679
- [V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in https://github.com/vllm-project/vllm/pull/11737
- [rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in https://github.com/vllm-project/vllm/pull/16426
- [Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in https://github.com/vllm-project/vllm/pull/16734
- [Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in https://github.com/vllm-project/vllm/pull/16741
- [V1] Remove log noise when idle by @russellb in https://github.com/vllm-project/vllm/pull/16735
- [Ray] Improve documentation on batch inference by @richardliaw in https://github.com/vllm-project/vllm/pull/16609
- [misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in https://github.com/vllm-project/vllm/pull/16760
- [Doc] Add more tips to avoid OOM by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16765
- [doc] add open-webui example by @reidliu41 in https://github.com/vllm-project/vllm/pull/16747
- [Bugfix] Fix GLM4 model by @intervitens in https://github.com/vllm-project/vllm/pull/16618
- [Doc] Fix a 404 link in installation/cpu.md by @windsonsea in https://github.com/vllm-project/vllm/pull/16773
- [Misc] refactor examples series - lmcache by @reidliu41 in https://github.com/vllm-project/vllm/pull/16758
- Improve configs -
TokenizerPoolConfig+DeviceConfigby @hmellor in https://github.com/vllm-project/vllm/pull/16603 - fix: hyperlink by @reidliu41 in https://github.com/vllm-project/vllm/pull/16778
- [Doc] Make sure to update vLLM when installing latest code by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16781
- [Doc] Document Matryoshka Representation Learning support by @noooop in https://github.com/vllm-project/vllm/pull/16770
- [Doc] Changed explanation of generation_tokens_total and prompt_tokens_total counter type metrics to avoid confusion by @insukim1994 in https://github.com/vllm-project/vllm/pull/16784
- [V1][Perf] Faster incremental detokenization by @njhill in https://github.com/vllm-project/vllm/pull/15137
- [Bugfix]Fix index out of range error in api server log by @WangErXiao in https://github.com/vllm-project/vllm/pull/16787
- [Kernel] Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 by @Ximingwang-09 in https://github.com/vllm-project/vllm/pull/16753
- [Model] use AutoWeightsLoader for olmoe,opt,orion,persimmon,phi3_small by @lengrongfu in https://github.com/vllm-project/vllm/pull/16548
- [TPU][V1] Fix padding recompilation when
max-num-batched-tokensis not even by @NickLucche in https://github.com/vllm-project/vllm/pull/16726 - [V1][TPU] Enable Top K by @NickLucche in https://github.com/vllm-project/vllm/pull/15489
- [ROCM] enable aiter fused moe kernel for llama4 bf16 checkpoints by @sijiac in https://github.com/vllm-project/vllm/pull/16674
- [V1][Metrics] Fix http metrics middleware by @markmc in https://github.com/vllm-project/vllm/pull/15894
- [MLA] Simplification to batch P/D reordering by @njhill in https://github.com/vllm-project/vllm/pull/16673
- [P/D][V1] KV Connector API V1 by @ApostaC in https://github.com/vllm-project/vllm/pull/15960
- [Attention] Update to lastest FA3 code by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/13111
- Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema by @tarukumar in https://github.com/vllm-project/vllm/pull/16721
- [Doc] Improve help examples for
--compilation-configby @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16729 - [Misc] Update outdated note: LMCache now supports chunked prefill by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16697
- [V1][Structured Output] Minor modification to
_validate_structured_output()by @shen-shanshan in https://github.com/vllm-project/vllm/pull/16748 - Add hardware print to TPU V1 test by @mgoin in https://github.com/vllm-project/vllm/pull/16792
- [BugFix] Accuracy fix for llama4 int4 - improperly casted scales by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/16801
- Improve configs -
MultiModalConfig+PoolerConfig+DecodingConfigby @hmellor in https://github.com/vllm-project/vllm/pull/16789 - [Misc] add collect_env to cli and docker image by @lengrongfu in https://github.com/vllm-project/vllm/pull/16759
- [ROCm] [Attention] Cleanup ROCm output passing by @ProExpertProg in https://github.com/vllm-project/vllm/pull/16431
- [Bugfix] fix pp for llama4 by @luccafong in https://github.com/vllm-project/vllm/pull/16746
- [Doc] add podman setup instructions for official image by @nathan-weinberg in https://github.com/vllm-project/vllm/pull/16796
- [Docs] Fix a link and grammar issue in production-stack.md by @windsonsea in https://github.com/vllm-project/vllm/pull/16809
- [Model] use AutoWeightsLoader for BigCode, GPT-J by @jonghyunchoe in https://github.com/vllm-project/vllm/pull/16823
- [Misc] Clean up Kimi-VL by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16833
- Fix
nullable_kvsfallback by @hmellor in https://github.com/vllm-project/vllm/pull/16837 - [New Model]: Snowflake Arctic Embed (Family) by @noooop in https://github.com/vllm-project/vllm/pull/16649
- [Misc] refactor examples series - Chat Completion Client With Tools by @reidliu41 in https://github.com/vllm-project/vllm/pull/16829
- [Doc] Updated Llama section in tool calling docs to have llama 3.2 config info by @jmho in https://github.com/vllm-project/vllm/pull/16857
- publish neuron docker image by @omrishiv in https://github.com/vllm-project/vllm/pull/16733
- [Model][VLM] Add Qwen2.5-Omni model support (thinker only) by @fyabc in https://github.com/vllm-project/vllm/pull/15130
- [rocm][MI300] llama4 maverick fp8 moe config tp8 by @divakar-amd in https://github.com/vllm-project/vllm/pull/16847
- [Frontend] Add sampling params to
v1/audio/transcriptionsendpoint by @NickLucche in https://github.com/vllm-project/vllm/pull/16591 - [Misc] Benchmarks for audio models by @NickLucche in https://github.com/vllm-project/vllm/pull/16505
- [V1][Misc] stop update prefix cache stats when logs_stats is disabled by @vie-serendipity in https://github.com/vllm-project/vllm/pull/16460
- [Model] Refactor Phi-4-multimodal to use merged processor and support V1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/15477
- [Model] Qwen2.5-Omni Cleanup by @ywang96 in https://github.com/vllm-project/vllm/pull/16872
- [VLM] Clean up models by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16873
- [doc] update hyperlink by @reidliu41 in https://github.com/vllm-project/vllm/pull/16877
- Log how much time loading a compiled artifact takes by @zou3519 in https://github.com/vllm-project/vllm/pull/16848
- Serialize tensors using int8 views by @p88h in https://github.com/vllm-project/vllm/pull/16866
- Improve configs -
CacheConfigby @hmellor in https://github.com/vllm-project/vllm/pull/16835 - [easy] Pass compile_fx only the config patches by @zou3519 in https://github.com/vllm-project/vllm/pull/16845
- [Bugfix] Fix v1/spec_decode/test_ngram.py by @zixi-qi in https://github.com/vllm-project/vllm/pull/16895
- [CI/CD][V1] Add spec decode tests to CI by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16900
- [Bugfix] Fix distributed bug in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in https://github.com/vllm-project/vllm/pull/16907
- [Doc] Split dummy_processor_inputs() in Multimodal Docs by @alex-jw-brooks in https://github.com/vllm-project/vllm/pull/16915
- Restore buffers when wake up from level 2 sleep (#16564) by @fingertap in https://github.com/vllm-project/vllm/pull/16889
- [Misc] fix collect_env version parse by @wangxiyuan in https://github.com/vllm-project/vllm/pull/15267
- [Misc] Refactor platform to get device specific stream and event by @shen-shanshan in https://github.com/vllm-project/vllm/pull/14411
- [Bugfix] Fix GLM rotary_dim issue and support v1 by @Isotr0py in https://github.com/vllm-project/vllm/pull/16912
- Raise error for data-parallel with benchmark_throughput by @kartikx in https://github.com/vllm-project/vllm/pull/16737
- [XPU][Bugfix] minor fix for XPU by @yma11 in https://github.com/vllm-project/vllm/pull/15591
- [doc] install required python3-dev apt package by @davidxia in https://github.com/vllm-project/vllm/pull/16888
- [Doc] mention how to install in CPU editable mode by @davidxia in https://github.com/vllm-project/vllm/pull/16923
- [Core] Speed up decode by remove synchronizing operation in sampler by @chanh in https://github.com/vllm-project/vllm/pull/16436
- [V1][Spec Decode] Handle draft tokens beyond max_model_len by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16087
- [TPU][V1] Implicitly adjust page size when there's SMEM OOM by @yaochengji in https://github.com/vllm-project/vllm/pull/16871
- Update Qwen1.5-MoE-W4A16-compressed-tensors.yaml by @mgoin in https://github.com/vllm-project/vllm/pull/16946
- [TPU][V1] Capture multimodal encoder during model compilation by @NickLucche in https://github.com/vllm-project/vllm/pull/15051
- [V1] V1 FlashInfer Attention by @mgoin in https://github.com/vllm-project/vllm/pull/16684
- [TPU][V1] Enable Top-P by @NickLucche in https://github.com/vllm-project/vllm/pull/16843
- [Doc] Remove unnecessary V1 flag by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16924
- [BugFix][Spec Decode] No in-place update to draft probs by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16952
- [Bugfix]: fix issue with n>1 sampling on v1 requests overriding each other by @jeffrey-dot-li in https://github.com/vllm-project/vllm/pull/16863
- [ROCm] Add aiter tkw1 kernel for Llama4 fp8 by @kliuae in https://github.com/vllm-project/vllm/pull/16727
- [Misc] Remove the chunked prefill warning for LoRA by @jeejeelee in https://github.com/vllm-project/vllm/pull/16925
- [Kernel] Add expert_map support to Cutlass FP8 MOE by @varun-sundar-rabindranath in https://github.com/vllm-project/vllm/pull/16861
- [V1] Remove additional_config check by @wangxiyuan in https://github.com/vllm-project/vllm/pull/16710
- [Performance][ROCm] Add skinny gemms for unquantized linear on ROCm by @charlifu in https://github.com/vllm-project/vllm/pull/15830
- Support S3 Sharded loading with RunAI Model Streamer by @omer-dayan in https://github.com/vllm-project/vllm/pull/16317
- [Bugfix] Fix f-string for Python 3.9-3.11 by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16962
- [Doc] Update ai_accelerator/hpu-gaudi.inc.md by @windsonsea in https://github.com/vllm-project/vllm/pull/16956
- [Perf] Optimize
_update_statesfor GPU model runner by @SnowCharmQ in https://github.com/vllm-project/vllm/pull/16910 - [Bugfix] Fix the issue where llm.generate cannot be called repeatedly after setting GuidedDecodingParams by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16767
- [Model] Use autoweightloader for mamba by @sfeng33 in https://github.com/vllm-project/vllm/pull/16950
- [V1] Remove pre-allocation for KV cache by @WoosukKwon in https://github.com/vllm-project/vllm/pull/16941
- [Kernel] Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS by @LeiWang1999 in https://github.com/vllm-project/vllm/pull/6036
- [BugFix] Fix incremental detokenization perf issue by @njhill in https://github.com/vllm-project/vllm/pull/16963
- [Doc] Improve documentation for multimodal CLI args by @DarkLight1337 in https://github.com/vllm-project/vllm/pull/16960
- [FEAT][ROCm] Integrate Paged Attention Kernel from AITER by @vllmellm in https://github.com/vllm-project/vllm/pull/15001
- [Misc] refactor example series by @reidliu41 in https://github.com/vllm-project/vllm/pull/16972
- [Bugfix] Fix distributed bug again in Qwen2.5-VL & Qwen2.5-Omni by @fyabc in https://github.com/vllm-project/vllm/pull/16974
- Improve configs -
SpeculativeConfigby @hmellor in https://github.com/vllm-project/vllm/pull/16971 - [BugFix] Pass in correct VLLM config in FlashInfer backend (#13207) by @timzsu in https://github.com/vllm-project/vllm/pull/16973
- [Misc] Add S3 environment variables for better support of MinIO. by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16977
- [frontend] enhance tool_calls type check by @reidliu41 in https://github.com/vllm-project/vllm/pull/16882
- [FEAT][ROCm]: Support AITER MLA by @vllmellm in https://github.com/vllm-project/vllm/pull/15893
- Add assertion for no objects while hashing hf_config by @zou3519 in https://github.com/vllm-project/vllm/pull/16930
- Fencing Kernels Tests for enabling on AMD by @Alexei-V-Ivanov-AMD in https://github.com/vllm-project/vllm/pull/16929
- [BugFix] Remove default multiproc executor
collective_rpctimeout by @njhill in https://github.com/vllm-project/vllm/pull/17000 - [Core][V1][TPU] Enable structured decoding on TPU V1 by @Chenyaaang in https://github.com/vllm-project/vllm/pull/16499
- [Bugfix] validate urls object for multimodal content parts by @gcalmettes in https://github.com/vllm-project/vllm/pull/16990
- add Dockerfile build vllm against torch nightly by @yangw-dev in https://github.com/vllm-project/vllm/pull/16936
- [Kernel][ROCM] Upstream prefix prefill speed up for vLLM V1 by @maleksan85 in https://github.com/vllm-project/vllm/pull/13305
- [V1][DP] More robust DP/EP dummy request coordination by @njhill in https://github.com/vllm-project/vllm/pull/16277
- [BugFix] Revert ROCm Custom Paged Attention Env Flag Check by @vllmellm in https://github.com/vllm-project/vllm/pull/17022
- Revert "[Misc] Add S3 environment variables for better support of MinIO." by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/17021
- [misc] tune some env vars for GB200 by @youkaichao in https://github.com/vllm-project/vllm/pull/16992
- [INTEL-HPU][v0] Port delayed sampling to upstream by @xuechendi in https://github.com/vllm-project/vllm/pull/16949
- [doc] add download path tips by @reidliu41 in https://github.com/vllm-project/vllm/pull/17013
- [Bugfix] Triton FA function takes no keyword arguments by @vllmellm in https://github.com/vllm-project/vllm/pull/16902
- [V1] Avoid socket errors during shutdown when requests are in in-flight by @njhill in https://github.com/vllm-project/vllm/pull/16807
- [BugFix] llama4 fa3 fix - RuntimeError: scheduler_metadata must have shape (metadata_size) by @LucasWilkinson in https://github.com/vllm-project/vllm/pull/16998
- [Misc] Improve readability of get_open_port function. by @gitover22 in https://github.com/vllm-project/vllm/pull/17024
- [Bugfix] Fix AssertionError: skip_special_tokens=False is not supported for Mistral tokenizers by @chaunceyjiang in https://github.com/vllm-project/vllm/pull/16964
- [CI] Run v1/test_serial_utils.py in CI by @russellb in https://github.com/vllm-project/vllm/pull/16996
- Mistral-format support for compressed-tensors by @mgoin in https://github.com/vllm-project/vllm/pull/16803
- Categorize
tests/kernels/based on kernel type by @mgoin in https://github.com/vllm-project/vllm/pull/16799 - [Doc] Add top anchor and a note to quantization/bitblas.md by @windsonsea in https://github.com/vllm-project/vllm/pull/17042
- Ensure that
pidpassed tokill_process_treeisintformypyby @hmellor in https://github.com/vllm-project/vllm/pull/17051 - [CI] Update structured-output label automation by @russellb in https://github.com/vllm-project/vllm/pull/17055
- Improve Transformers backend model loading QoL by @hmellor in https://github.com/vllm-project/vllm/pull/17039
-
CacheConfig.block_sizeshould always beintwhen used by @hmellor in https://github.com/vllm-project/vllm/pull/17052 - Use
@propertyand private field fordata_parallel_rank_localby @hmellor in [https://github.com/vllm-project/vllm/pull/17053](https://redirect.github.com/vllm-project/vllm/pull/170
Configuration
📅 Schedule: Branch creation - "" (UTC), Automerge - At any time (no schedule defined).
🚦 Automerge: Enabled.
♻ Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
- [ ] If you want to rebase/retry this PR, check this box
This PR has been generated by Renovate Bot.
Edited/Blocked Notification
Renovate will not automatically rebase this PR, because it does not recognize the last commit author and assumes somebody else may have edited the PR.
You can manually request rebase by checking the rebase/retry box above.
⚠️ Warning: custom changes will be lost.