Use __shfl_down for cuda affine interp backward

Open jacobhinkle opened this issue 7 years ago • 0 comments

cf. https://devblogs.nvidia.com/faster-parallel-reductions-kepler/

Shuffle intrinsics on nvidia gpus can dramatically speed up custom reductions. Currently the method i use has lots of thread synchronization so there is a lot of room for improvement probably. This should probably come after we start a benchmarking suite.

Nov 20 '18 17:11 jacobhinkle