KernelAbstractions.jl Atomic attempts 3

This re-implements the atomic_... functions from #282 so I can keep using them if I need for testing down the road. Also: I can't quite get atomix to work for some of my use-cases, so I still need the "primitives". The Atomix tests have been moved to #308

Jun 13 '22 16:06 leios

I split off the atomix tests because they are not passing yet. This PR is currently passing all the tests locally and most of the tests here (I don't know the root of the buildkite errors just yet). There are a few users who have specifically requested Core.Intrinsics / CUDA atomic functionality and this PR serves that role.

More than that, it's currently working. I feel it's worth merging now even if the core functionality is replaced by Atomix in the end.

Jun 13 '22 20:06 leios

For those that are struggling with Atomix in KA, you can use this PR with:

] add https://github.com/leios/KernelAbstractions.jl#atomic_attempts_3
] add https://github.com/leios/KernelAbstractions.jl:lib/CUDAKernels#atomic_attempts_3

The first line adds KA, the second line adds CUDAKernels.

Also: examples of how to use each function can be found in the tests (https://github.com/leios/KernelAbstractions.jl/blob/atomic_attempts_3/test/atomic_test.jl). There are doc strings as well, but they are not built anywhere and essentially reflect the CUDA doc strings

Jun 14 '22 06:06 leios

Just FYI, I'm almost sure Atomix + AtomixCUDA.jl, instead of Atomix + UnsafeAtomicsLLVM.jl, already covers what this PR does. I designed Atomix.jl to provide various dispatch point and make it flexible enough to cover this. Just load AtomixCUDA.jl before using KA + Atomix.jl + CUDA.jl.

As mentioned in https://github.com/JuliaConcurrent/Atomix.jl/issues/33, the UnsafeAtomicsLLVM.jl approach needs some @device_override in CUDA.jl to workaround some LLVM codegen limitations. I suggested Atomix + UnsafeAtomicsLLVM.jl since it covers histogram-like use case and I thought that's what people need at the moment. If you want to invoke other atomic operations but prefer avoiding a PR for CUDA.jl, I feel it would be cleaner to resurrect AtomixCUDA.jl and use Atomix as the default API. This way, we can gradually drop "behind the scene hacks" like AtomixCUDA.jl and UnsafeAtomicsLLVM.jl while minimizing effects on user code.

Jun 14 '22 11:06 tkf

I have no problem with Atomix being the default API in the long run (though I have a somewhat strong preference to just using atomic_add(...) instead of using a macro for the same thing). The truth is that there are a few edge-cases I want to test with Atomix before I feel comfortable using it, which is why I wanted to package it as a separate PR in #282 and said #308 was not ready to be merged.

My main issue with Atomix is that it fails on somewhat basic tasks right now, like adding floats on the CPU or subtracting numbers on the GPU (#308). I get that there are ways around this, which is why I separated out #308. The hope is that we can figure out how to get the tests to pass so we can actually have proper atomic functionality in KA.

Also: I squashed this PR so it's a single commit. Once we have Atomix actually working, we can just revert this one at that time. I understand that in the long-run we shouldn't have multiple atomic apis, but Julia currently has a bunch of different apis, so I don't see the problem in giving people some choice while Atomix is becoming stable.

Jun 14 '22 11:06 leios

though I have a somewhat strong preference to just using atomic_add(...) instead of using a macro for the same thing

I totally agree that there should always be non-macro-base a interface. This is why Atomix.jl has non-macro interfaces https://juliaconcurrent.github.io/Atomix.jl/dev/interface/. UnsafeAtomics.jl can also potentially become non-macro "APIs" that cover what this PR does but more (i.e., it can specify ordering) although it's not documented (and hence not technically public interface).

like adding floats on the CPU

As of https://github.com/JuliaConcurrent/UnsafeAtomics.jl/pull/9 more operations support floats

subtracting numbers on the GPU

AtomixCUDA.jl should be able to cover this since it simply dispatches to CUDA.atomic_* like this PR does.

Jun 14 '22 15:06 tkf