Luka Govedič comments

Results 95 comments of


                                            Luka Govedič

[BugFix] fix speculative decoding memory leak when speculation is disabled

~~Btw - this broke spec_decoding tests on main because the `EngineArgs` interface changed. Pushing a fix.~~ Fixed in #17754

[RFC]: Changes to support attention + quant fusion

> Maybe call it input_scale since that coincides with the actual parameter name the values are coming from. I think that might be confusing just looking at a call to...

[RFC]: Changes to support attention + quant fusion

> > output_scale=scale, quant_config=(True, True, torch.fp8_e4m3fn) > > if these are static (not changing for every batch of input), you can store it in the attention class. I guess that...

[RFC]: Changes to support attention + quant fusion

I guess how would we access the `forward_context` during `torch.compile`, is it available?

[RFC]: Changes to support attention + quant fusion

> Why store in the attention object when the value is already loaded into o_proj.input_scale which is RowParallelLinear? The attention object doesn't have access to `o_proj`

[RFC]: Changes to support attention + quant fusion

@youkaichao I'm realizing that returning both 16-bit (before fusion) and 8-bit values (after quant fusion) from `unified_attention` might pose trouble as the fake op (used in tracing) will be wrong...

[RFC]: Changes to support attention + quant fusion

@youkaichao I think we don't have access to the `scale` object in the `fx.Graph` (it's a parameter to the graph) so I think we should pass it to `unified_attention_with_output`. The...

[RFC]: Changes to support attention + quant fusion

With full CUDAGraph support, we can set `splitting_ops=[]` and make this fusion work in V1. This RFC has been implemented. If in the future we want to make piecewise compilation...

[Bug fix] ROCm FlashAttention: add missing `full_scales` argument to Triton wrapper

Yes, #15734 was supposed to be merged after #12591. It got merged first so attention was broken for a second

[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend

@bigPYJ1151 could you take a look at #7270 - I changed the ops bindings for quantization that you added to the CU backend in this PR