Hugo Phibbs
Hugo Phibbs
Is it also necessary to remove the space in the cmake expressions on lines 46-47? ``` $ $ ```
@edwinsolisf @willyborn I added both of of these changes and the perf issue still remains. As a side note, this is taking 30 seconds for nothing to happen. So I...
@edwinsolisf sure X has shape (70_000, 784). A has shape (70_000, 10), B has shape (2048, 50). The distances array has shape (70_000, 500). `YBatch` comes out to (20, 500,...
Hi @luitjens I'm using batches because otherwise, my GPU quickly runs out of memory (I tried no batching with CuPy and this was the result). Batching is used to control...
Hi @luitjens, thx for getting back to me. I'm timing the complete function runtime - as in how long it takes to run the function start to finish. The timing...
Hi again, I've done some more testing, and I've found that the cuda synchronise step takes the lion's share of the runtime. I added some hacky profiling to the function...
Ok thx, pls see the gist: https://gist.github.com/HugoPhibbs/a2ce2c75b70c6737f1094f32b15af3ea It contains source files to run it, along with an nsys profile
Ok thanks. Honestly I'm a little bit skeptical that it would take just a fraction of a second. But yes, the error still reproduces on my machine: ```shell make repro...
thx @cliffburdick and @luitjens @luitjens re `ncu`, currently waiting for admin permissions to run `sudo ncu ...`, I'll send results once I can. As on the front of upgrading CUDA,...
@luitjens yep added the macro and no errors occur