Jingyue Wu issues

Results 7 issues of


                                            Jingyue Wu

Allocate dQ, dK, and dV as a catted tensor to save a downstream cat in nvFuser.

The description on the added compile option explains what this optimization does. This optimization is disabled by default for now. I'll try to enable it by default or even always...

A follow-up cleanup on #57.

Make cudnnex SDPA checkers more conservative.

#57's description and https://github.com/Lightning-AI/lit-thunder-LEGACY/pull/2480#issuecomment-2013537240 have the context. The proposed improvement is to properly propagate graph-not-supported errors from the cudnn backend to the frontend as distinguishable exceptions. This way, we can...

enhancement

cudnn

Disable bookend by default.

[Benchmark results](https://gist.github.com/wujingyue/ef92da74ba519987a4a4c764865dd481) don't look good enough at this moment to merge. Highlights: - test_nanogpt_layer_norm[forward-thunder] - test_litgpt_qkv_split_rope for phi-2 Lowlights: - test_nanogpt_gpt2[inference-thunder] - test_llama_2_7b_hf[inference-thunder] - test_llama_2_7b_hf[forward-thunder] - test_llama2_causal_self_attention_7b[inference-thunder] - test_llama2_causal_self_attention_7b[forward-thunder] -...

nvfuser

Remove the `--profile` option

Instead, check whether the script is under nsys via `NSYS_PROFILING_SESSION_ID`. Note that it's still possible to profile warmup iterations -- just don't specify `--capture-range cudaProfilerStart` in the `nsys` command. This...

Use torch._grouped_mm in eager mode

This gives a fair comparison between eager and other modes. The constraints mentioned in the comment seem to have been fixed by https://github.com/pytorch/pytorch/pull/161407 `python thunder/benchmarks/benchmark_inference.py` at head runs fine on...