benchmark
benchmark copied to clipboard
[WIP] Use sync-free cuda event timing in benchmark
This PR does not mean the final form of torchbench code changes. I think it's rather a discussion on how we should implement a sync-free cuda event timing mechanism.
This PR uses sync-free cuda event timing as suggested in https://github.com/pytorch/pytorch/issues/93767
See also https://github.com/pytorch/pytorch/issues/93767
I've created a snippet here to show this idea. https://gist.github.com/xwang233/f00433a7826f485858ff0eaa59b3bd59