managed_allocator cudaMalloc/cudaMallocManaged support

This is pretty awesome, as-is. Thank you so much for this class.

However, I wonder if it would be possible to update this to use cudaMalloc() OR cudaMallocManaged(). Also, in the cudaMallocManaged case, one could choose cudaMemAttachGlobal or cudaMemAttachHost.

I guess if it could handle non-managed allocation, then that would go against the name of the class, though.

Jun 30 '20 17:06 hazevt04

Hi @hazevt04, thanks for the feedback.

I think you're right that the choice between cudaMalloc vs cudaMallocManaged is probably better represented by separate allocator types. I've never looked into the difference between cudaMemAttachGlobal vs cudaMemAttachHost. How do you imagine their interaction with an allocator?

You may also be interested in the more complete work going on in this repository: https://github.com/jaredhoberock/cumem

Jun 30 '20 17:06 jaredhoberock

I'm developing on a Jetson Xavier (CC 7.2) which has integrated Volta cores that share CPU memory (no cudaMemcpy() needed!) I got the idea here:

https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html

I've been following that example for my development.

I'm using a stream with managed memory and want to indicate whether the memory can be accessed by any stream ('Global') or not. It is possible to change the memory's stream association later with cudaStreamAttachMemAsync() which then triggers prefetching of data to, or data from, the GPU.

I have some post-processing for the data from the GPU, where I want to use some std::algorithm functions. Before I found your allocator class, I was using plain-old-arrays and had to copy to temporary std::vectors (yuck!) during my post-processing. I wanted to use std::vector with the managed_allocator ('man_vec') from the start and pass man_vec.data() to the kernel launches and then still be able to use the std::algorithm functions in my post-processing.

Unfortunately I've run into a bit of trouble with the cudaFree in the deallocator() failing. I have a while loop copying values from a synchronized queue to a 'man_vec.' I call man_vec.reserve(num_vals) to make sure it only allocates at the start. I don't want man_vec to have to allocate/free throughout the runtime. cudaMalloc, cudaManagedMalloc, and cudaFree are slower than the rest of my app. I hope to figure this out today.

I noticed that the managed_allocator demo code also shows how to use Thrust, and Thrust includes some std::algorithm-like functions. Maybe I'll switch to that.

I'll check out cumem, especially the test code you have. Thanks!

Jul 01 '20 14:07 hazevt04

Interesting, thanks for the info. Sounds like cudaMemAttachGlobal/cudaMemAttachHost could be constructor parameters for a managed_allocator. Once you've created your std::vector, do you have any reason to change the attachment during its lifetime? I ask because I don't think it's possible to mutate a container's allocator once it's been established.

Unfortunately I've run into a bit of trouble with the cudaFree in the deallocator() failing. I have a while loop copying values from a synchronized queue to a 'man_vec.' I call man_vec.reserve(num_vals) to make sure it only allocates at the start. I don't want man_vec to have to allocate/free throughout the runtime. cudaMalloc, cudaManagedMalloc, and cudaFree are slower than the rest of my app. I hope to figure this out today.

If you're able to get to the bottom of the failures, please let me know.

cudaMalloc, cudaManagedMalloc, and cudaFree are slower than the rest of my app.

CUDA memory allocation is a notorious performance bottleneck. If you need to allocate memory during your app's lifetime, you should consider some sort of suballocation scheme.

Jul 01 '20 16:07 jaredhoberock

I don't expect the allocator to change the attachment of the memory it allocates during the lifetime of a std::vector instance. I assume that once I set the flag for the allocator for an instance of std::vector that all future allocations to that std::vector instance will have that attachment. My goal is to only allocate early in the std::vector's lifetime.

I think I can use cudaStreamAttachMemAsync() to change the attachment for memory already allocated in a std::vector instance.

You bring up a good point: What happens if my std::vector instance needs an allocation that is set to one attachment after have the changed the attachment of the rest of the vector?

I have separate std::vectors with different attachments, like in the example code here:

https://docs.nvidia.com/cuda/cuda-for-tegra-appnote/index.html#pinned-memory

I wound up not calling cudaFree in the deallocator() and instead calling cudaFree( man_vec.data() ) at the same level that I create the "man_vec's. (Actually ...I have multiple man_vec's in a class. ("Fun times") and I call cudaFree(man_vec.data()) in the class's destructor) I'm not sure if this is a good solution though. Normally one would call man_vec.clear() and that would cause cudaFree().

Maybe I need to rethink my class structure to make sure that only the highest level scope actually calls cudaFree(). My best guess about my issue with cudaFree() failing was that it was being called for some memory I was still using.

Jul 01 '20 21:07 hazevt04

This isn't really an issue per se, but a gotcha I noticed. When I launched a CUDA kernel using a vector.data() pointer for output and the kernel succeeded, vector.size() was still 0 afterward!

But actually, the data was there! For example, I can access vector[N-1] (via printf or whatever) and see the expected output! Woa.

I guess because I didn't modify the std::vector's data via any of the std::vector functions (e.g. push_back(), pop_front()), then the std::vector didn't know that it's data had been modified by the CUDA kernel!

Jul 08 '20 21:07 hazevt04

I guess because I didn't modify the std::vector's data via any of the std::vector functions (e.g. push_back(), pop_front()), then the std::vector didn't know that it's data had been modified by the CUDA kernel!

Yes, that is what happened. You should ensure that the vector's elements exist before mutating them via side effects in a CUDA kernel.

Jul 09 '20 18:07 jaredhoberock