Luka Govedič

Results 20 issues of Luka Govedič

Added support for bfloat16, as we can now detect it on the architecture.

Added getQuaternion() - a function that returns the quaternion. This is necessary for advanced usages (kalman filtering).

In case someone fails to create the V2 pipeline, I think it's helpful to print the error.

This is currently a hack but it would be great to get a version of this into production so that we can use debug_analysis on the pipeline and pass real...

Add unit test for AQ AZP folding and add epilogue that supports per-token azp. **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR Checklist...

This PR adds kernels for asymmetric quantization of activations. Tests are included. --- PR Checklist (Click to Expand) Thank you for your contribution to vLLM! Before submitting the pull request,...

ready

This PR resolves #8002 and builds vllm-flash-attn from source. This is required for using torch nightly. This PR relies on the new CMake-based build system in vllm-flash-attn. To make installation...

This PR enables fusing rms_norm and quant ops in the torch.compile backend. It adds all required infrastructure and new fused rms_norm_quant kernels. Only static FP8 quantization is supported in this...

Do not pad the `fp8` operations in the non-cutlass case when compiling as branch specialization might not work correctly, and it makes fusion difficult. This is a follow-on PR to...

needs-rebase

This PR replaces `apply_fp8_linear` and `apply_fp8_linear_generic` with objects so that VllmConfig can be accessed in their `__init__` method as opposed to the `forward` method.

ready
v1