Luka Govedič

Results 95 comments of Luka Govedič

~~Btw - this broke spec_decoding tests on main because the `EngineArgs` interface changed. Pushing a fix.~~ Fixed in #17754

> Maybe call it input_scale since that coincides with the actual parameter name the values are coming from. I think that might be confusing just looking at a call to...

> > output_scale=scale, quant_config=(True, True, torch.fp8_e4m3fn) > > if these are static (not changing for every batch of input), you can store it in the attention class. I guess that...

I guess how would we access the `forward_context` during `torch.compile`, is it available?

> Why store in the attention object when the value is already loaded into o_proj.input_scale which is RowParallelLinear? The attention object doesn't have access to `o_proj`

@youkaichao I'm realizing that returning both 16-bit (before fusion) and 8-bit values (after quant fusion) from `unified_attention` might pose trouble as the fake op (used in tracing) will be wrong...

@youkaichao I think we don't have access to the `scale` object in the `fx.Graph` (it's a parameter to the graph) so I think we should pass it to `unified_attention_with_output`. The...

With full CUDAGraph support, we can set `splitting_ops=[]` and make this fusion work in V1. This RFC has been implemented. If in the future we want to make piecewise compilation...

Yes, #15734 was supposed to be merged after #12591. It got merged first so attention was broken for a second

@bigPYJ1151 could you take a look at #7270 - I changed the ops bindings for quantization that you added to the CU backend in this PR