Luka Govedič issues

Results 20 issues of


                                            Luka Govedič

Bfloat16

Added support for bfloat16, as we can now detect it on the architecture.

Added getQuaternion()

Added getQuaternion() - a function that returns the quaternion. This is necessary for advanced usages (kalman filtering).

Better legacy failure warning

In case someone fails to create the V2 pipeline, I think it's helpful to print the error.

Debug pipeline

This is currently a hack but it would be great to get a version of this into production so that we can use debug_analysis on the pipeline and pass real...

[Kernel] Add per-token AZP epilogue

Add unit test for AQ AZP folding and add epilogue that supports per-token azp. **BEFORE SUBMITTING, PLEASE READ THE CHECKLIST BELOW AND FILL IN THE DESCRIPTION ABOVE** --- PR Checklist...

[Kernel] AQ AZP 3/4: Asymmetric quantization kernels

This PR adds kernels for asymmetric quantization of activations. Tests are included. --- PR Checklist (Click to Expand) Thank you for your contribution to vLLM! Before submitting the pull request,...

ready

[Kernel] Build flash-attn from source

This PR resolves #8002 and builds vllm-flash-attn from source. This is required for using torch nightly. This PR relies on the new CMake-based build system in vllm-flash-attn. To make installation...

[torch.compile] Fuse RMSNorm with quant

This PR enables fusing rms_norm and quant ops in the torch.compile backend. It adds all required infrastructure and new fused rms_norm_quant kernels. Only static FP8 quantization is supported in this...

Fix for the padding in the non-cutlass-fp8 case

Do not pad the `fp8` operations in the non-cutlass case when compiling as branch specialization might not work correctly, and it makes fusion difficult. This is a follow-on PR to...

needs-rebase

[FP8] Refactor apply_fp8_linear and apply_fp8_linear_generic into an object

This PR replaces `apply_fp8_linear` and `apply_fp8_linear_generic` with objects so that VllmConfig can be accessed in their `__init__` method as opposed to the `forward` method.

ready