softmax kernel
@SimonDanisch @MikeInnes What was the conclusion regarding this PR?
I have a working, reasonably fast, but not very generic CUDA softmax in https://github.com/jekbradbury/Transformer.jl/blob/master/src/kernels.jl
Yeah looks like it’s relatively CUDA-specific.
I wonder if it would be easier to port James’s kernel to OpenCL versus writing the OpenCL softmax kernel from scratch.
On Tue, Jul 17, 2018 at 03:26 James Bradbury [email protected] wrote:
I have a working, reasonably fast, but not very generic CUDA softmax in https://github.com/jekbradbury/Transformer.jl/blob/master/src/kernels.jl
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/GPUArrays.jl/pull/106#issuecomment-405486257, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXAraOOvJrUq65Uo0GsOlpZNlI13kPSks5uHZGMgaJpZM4SUB7x .
--
Dilum Aluthge
[email protected] [email protected]
https://www.aluthge.com https://www.aluthge.com
I wonder if it would be easier to port James’s kernel to OpenCL versus
We can just port it to julia in a way that it works with both CLArrays + CuArrays.
I already took a look at it - the only thing holding us back is, that @jekbradbury used dynamic shared memory, that behaves a bit peculiar compared to the CuStaticSharedMem (which is also supported by CLArrays, when you use the GPUArray version). I had a stab at supporting dynamic shared memory in GPUArrays vendor independently, but couldn't implement it in the time frame I set myself... In theory it's quite straightforward and I should make a PR out of what I had ;)
I don't know if there's any particular reason Marian-NMT used dynamic shared memory for this rather than static. (Also, this kernel contains a reasonably fast mapreducedim implementation for reductions over the inner dim, so it would be useful to include that separately if someone works on porting)