GPUArrays.jl icon indicating copy to clipboard operation
GPUArrays.jl copied to clipboard

softmax kernel

Open SimonDanisch opened this issue 7 years ago • 5 comments

SimonDanisch avatar Feb 26 '18 22:02 SimonDanisch

@SimonDanisch @MikeInnes What was the conclusion regarding this PR?

DilumAluthge avatar Jul 16 '18 22:07 DilumAluthge

I have a working, reasonably fast, but not very generic CUDA softmax in https://github.com/jekbradbury/Transformer.jl/blob/master/src/kernels.jl

jekbradbury avatar Jul 17 '18 07:07 jekbradbury

Yeah looks like it’s relatively CUDA-specific.

I wonder if it would be easier to port James’s kernel to OpenCL versus writing the OpenCL softmax kernel from scratch.

On Tue, Jul 17, 2018 at 03:26 James Bradbury [email protected] wrote:

I have a working, reasonably fast, but not very generic CUDA softmax in https://github.com/jekbradbury/Transformer.jl/blob/master/src/kernels.jl

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/JuliaGPU/GPUArrays.jl/pull/106#issuecomment-405486257, or mute the thread https://github.com/notifications/unsubscribe-auth/AFXAraOOvJrUq65Uo0GsOlpZNlI13kPSks5uHZGMgaJpZM4SUB7x .

--

Dilum Aluthge

[email protected] [email protected]

https://www.aluthge.com https://www.aluthge.com

DilumAluthge avatar Jul 17 '18 16:07 DilumAluthge

I wonder if it would be easier to port James’s kernel to OpenCL versus

We can just port it to julia in a way that it works with both CLArrays + CuArrays.

I already took a look at it - the only thing holding us back is, that @jekbradbury used dynamic shared memory, that behaves a bit peculiar compared to the CuStaticSharedMem (which is also supported by CLArrays, when you use the GPUArray version). I had a stab at supporting dynamic shared memory in GPUArrays vendor independently, but couldn't implement it in the time frame I set myself... In theory it's quite straightforward and I should make a PR out of what I had ;)

SimonDanisch avatar Jul 17 '18 18:07 SimonDanisch

I don't know if there's any particular reason Marian-NMT used dynamic shared memory for this rather than static. (Also, this kernel contains a reasonably fast mapreducedim implementation for reductions over the inner dim, so it would be useful to include that separately if someone works on porting)

jekbradbury avatar Jul 17 '18 19:07 jekbradbury