Christian Sarofeen
Christian Sarofeen
Good point on the whitespace. I'll start working on an example.
You're correct @ngimel. We can end up generating many kernels, if we want to limit the number of kernels we generate we would need implement coarser grained heuristics (definitely possible...
The highest level that Eddie was working on, or the heuristics? Heuristic recompilations are all dependent on heuristic changes and subject to change from one release to the next. Practically...
FYI I intend to review (can't set myself as a reviewer)
> cc @csarofeen for regressions in backward, my understanding was that (at least for not-channels-last) aot is a win. I haven't seen significant regressions in backwards except in channels last...
We explicitly tested on 1.12 release, CC @ptrblck and @kevinstephano in case we were testing something slightly different. Definitely keep us posted, we're highly motivated to get our codegen in...
I did a sweep of LayerNorm FWD and BWD on the sizes I generally use for my "TIMM micro benchmarks": Product of: N [8, 16, 32, 64, 128, 256] C...
Benchmark changes for what it's worth: https://github.com/csarofeen/pytorch/pull/1833
I think it would perform marginal gains, but yeah, I'm trying to think what XLA is doing that's so amazing, or what are we doing that's so bad that nvFuser...
If Apex LN is working for you go for it, it's disappointing because it's highly unlikely the big perf difference is because of the code generated but the integration in...