lagomorph
lagomorph copied to clipboard
Use __shfl_down for cuda affine interp backward
cf. https://devblogs.nvidia.com/faster-parallel-reductions-kepler/
Shuffle intrinsics on nvidia gpus can dramatically speed up custom reductions. Currently the method i use has lots of thread synchronization so there is a lot of room for improvement probably. This should probably come after we start a benchmarking suite.