shivadbhavsar

Results 15 comments of shivadbhavsar

You can run the pytorch sdxl on its own on your system right? In general, we try and avoid duplicating weights when compiling but sometimes the compilation steps in migraphx...

Initial work resulted in no perf difference. Rocprofiler results on trimmed unet: 1. using `MIGRAPHX_MLIR_USE_SPECIFIC_OPS=attention` - Cache hits are mostly the same with and without reversal (with some being considerably...

Next Steps: Understand cache hits with even smaller graphs 1. Performed test with `mul -> dot -> add` program which is compiled as `mul -> dot_add` where mlir_dot_add is reverse...

SDXL Pref results for reference: Torch-MIGraphX (end to end): Before PR: 2850 ms With PR: 2801 ms ONNX Unet (4x attn trim): Before PR: 5.54 ms After PR: 5.52 ms...

Even with #3659, the flux model doesnt give a proper output when using `MIGRAPHX_DISABLE_LAYERNORM_FUSION=1`. Need to resolve that before we can remove this

Agreed, is that a change in migraphx or is that something mlir needs to support first?

Small repro: ``` p = migraphx.program() mm = p.get_main_module() s1 = migraphx.shape(lens=[4096, 768], type="float_type") in1 = mm.add_parameter("x", s1) in1 = mm.add_instruction(migraphx.op("reshape", dims=[2, 2048, 768]), [in1]) in1 = mm.add_instruction(migraphx.op("reshape", dims=[2, -1,...

Heres when the issue starts: ``` Pass: fuse_reduce Pass: dead_code_elimination x2 = @param:x2 -> float_type, {2, 12, 2048, 2048}, {50331648, 4194304, 2048, 1} x = @param:x -> float_type, {4096, 768},...

Here is some example code used for the experiment. 1. MLIR multi output fusion (in fuse_mlir pass) (This needs to be refactored to account for incoming changes: #3569 and #3752...