DaggerGPU.jl
DaggerGPU.jl copied to clipboard
GPU integrations for Dagger.jl
Currently we synchronize from the host for each kernel, which is unlikely to provide competitive performance vs. standard task-based GPU programming. We should consider some way to track the streams...
We have these for CUDA, but we should also have these for ROCm. The APIs are likely nearly identical, so this should be easy.
Currently we only convert between plain, dense array types, but don't handle wrappers like `Adjoint`, which is super unfortunate. We should more reliably use Adapt for arbitrary objects so that...
I tried to create a small benchmark to see if `DaggerGPU.jl` can help me asynchronously move data to and from the GPU. However, it seems that the memory on the...
Closes #28.
Hello, would it be possible to add a usage example? I couldn't find one here, nor in the Dagger.jl docs. For example, let's say I have the following task: ```julia...
Within Datadeps, we try to synchronize the stream of incoming arrays (which should probably be the same as the current task-local stream), but we don't necessarily synchronize back to the...