Luka Govedič
Luka Govedič
Added support for bfloat16, as we can now detect it on the architecture.
Added getQuaternion() - a function that returns the quaternion. This is necessary for advanced usages (kalman filtering).
In case someone fails to create the V2 pipeline, I think it's helpful to print the error.
This is currently a hack but it would be great to get a version of this into production so that we can use debug_analysis on the pipeline and pass real...
Add unit test for AQ AZP folding and add epilogue that supports per-token azp. **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR Checklist...
This PR adds kernels for asymmetric quantization of activations. Tests are included. --- PR Checklist (Click to Expand) Thank you for your contribution to vLLM! Before submitting the pull request,...
This PR resolves #8002 and builds vllm-flash-attn from source. This is required for using torch nightly. This PR relies on the new CMake-based build system in vllm-flash-attn. To make installation...
This PR enables fusing rms_norm and quant ops in the torch.compile backend. It adds all required infrastructure and new fused rms_norm_quant kernels. Only static FP8 quantization is supported in this...
Do not pad the `fp8` operations in the non-cutlass case when compiling as branch specialization might not work correctly, and it makes fusion difficult. This is a follow-on PR to...
This PR replaces `apply_fp8_linear` and `apply_fp8_linear_generic` with objects so that VllmConfig can be accessed in their `__init__` method as opposed to the `forward` method.