more flexible memory handling
- You chose a managed memory scheme. This is good for entry level, but to consider it for actual systems it would be nice to have some more allocators implemented within the library (or just support the ones from RAPIDS).
- In the absence of managed memory it would be nice to have a (Pinned memory) CPU variation of the tensor that handles GPU/CPU copy. Also with allocators...
- Vector type support would be nice (uchar4...)
This sort of basic data structure is very much in need and the lazy execution model is looking compact and useful...
hi @trporn, thanks for filing this. we currently support managed memory allocations inside the tensor, but we also support custom pointers in each variant of the make_tensor functions. this would allow you to also used pinned memory if you wish. are you instead looking for a way to have the library use pinned memory, but without you allocating it first?
can you elaborate on the vector support? we have experimented with adding vector load and store support internally with elementwise kernels, but there was not a noticeable performance increase. I'm not sure if this is what you're asking for, or if you would like the actual data type of the tensor to be a vector type.
@trporn I should also mention that having vector types as input types should already work if you define all the operators needed. For example, if all you want is to add two uchar4, you could define operator+ on uchar4, and MatX will use that operator if it exists.
If MatX can be instantiated with vector types then this is good enough, thanks! It was not clear from the type list in the docs. Of course that the users have to implement the operators... These types are useful not just for performance but for intrinsic arrangement properties (for example, RGBA format of images)
As for the allocators, I would expect the library to come with a few useful ones (cached device/pinned per thread/stream to start with). Rapids made an effort to introduce a uniform interface over a few domains. See https://github.com/rapidsai/rmm I have been doing Cuda projects forever and the first thing a client often needs is a container that supports configurable allocators with ease.
thanks for clarifying. we can add it to the backlog to implement the vector operators for all standard cuda types. there may be some limitations since these are unlikely to work in some b as backend libraries like cuBLAS. for example, if you transpose a matrix of float4 as a view, we cannot currently tell cublas about the new layout since it doesn't support these types.
for allocators, we did start to add a custom allocator interface in #48. we paused the work because there's concurrent work going on in libcudaxx to add a standardized custom allocator interface, similar to RMM. if we add RMM's PMR interface it doesn't give the user enough control, since we need to take a stream and a pmr interface, and not just have the stream be part of the allocate and deallocate parameters. we will continue this work once this gets a bit more finalized.
however, if all you're asking for is to have the internal memory type be different (pinned), that's certainly not hard to do and doesn't require a custom allocator interface.
transposing float4 is the same as transposing a doubleComplex because only the element size matters, not the content, but I see your general point. thanks for the clarifications.
That's right, but as you pointed out, all of these types are in the CUDA C headers, and not C++. They do not have operator overloads for any of these. It would be trivial to add most of them, but for complex types you'd still need to stick with cuda::std::complex. The reason is that calling cuda::std::abs on a float4 should give you the absolute value of all four elements independently, but calling that on a complex number should give you the complex magnitude back. In the latter case would actually change the output data type to a real, while the former case doesn't.
@trporn just to add a bit more detail on the allocator question, you can see here that we can take a custom allocator: https://github.com/NVIDIA/MatX/blob/main/include/matx_storage.h#L48
You'll also notice that all of the make_ functions just use matx_allocator for now: https://github.com/NVIDIA/MatX/blob/main/include/matx_make.h#L215
This could easily be adapted to use whatever allocator you want as another default template parameter. The problem is that a PMR-style interface (like RMM uses) takes a stream to allocate an deallocate as a parameter:
https://github.com/rapidsai/rmm/blob/branch-22.04/include/rmm/mr/device/polymorphic_allocator.hpp#L82
This is an extension of the std PMR interface, since they obviously don't take streams: https://en.cppreference.com/w/cpp/memory/polymorphic_allocator/allocate
This is fine if we're calling allocate only once on construction, but this is not the case in general. We have situations where we need to do small asynchronous memory allocations as part of normal operation (padded FFTs, for example). To be stream-oriented, we would need the Allocator object to encapsulate the stream parameter in the allocator object itself and not just be a parameter to allocate. This can, of course, be done already by wrapping RMM or any other allocator with a stream member, but this has not been standardized within the CUDA libraries from what I'm aware. We didn't want to jump the gun and implement something that may change later to a standard interface we can use.
For the reason, you can still provide your own custom allocators to MatX with something like:
class my_allocator {
public:
my_allocator(cudaStream_t stream) : stream(stream) {}
void allocate(... // Allocate using stream
void deallocate(...
private:
cudaStream_t stream;
};
I believe this will get you what you want, but with a bit more work than it should in the future.
Thank you for the detailed reply. I will give it a try at a coming project
Closing this for now unless @trporn thinks it needs to be reopened.