oneDPL icon indicating copy to clipboard operation
oneDPL copied to clipboard

Redefine transform_reduce's scratch & result mem

Open AidanBeltonS opened this issue 2 years ago • 3 comments

Rationale

Algorithms using transform_reduce (std::reduce, std::max_element, etc) return a single value result. They also, except for the smallest cases, require intermediate scratch memory on the device to hold partial results.

For L0 backend, it's faster to have a 1-element USM host allocation and to write the final reduction result directly to that.

For Nvidia, host USM is expensive, and it's faster instead to have a single USM device allocation for both the result and the intermediate scratch.

Approach

This PR combines the two approaches into a struct __usm_host_or_unified_storage, based on the previous __usm_host_or_buffer_storage. When host USM is supported and the backend is L0, this struct holds two memory allocations (device USM for intermediate scratch, and host USM for final result). In all other cases, this struct holds a single device USM allocation, large enough for both intermediate scratch and final result. In this latter case, a memcpy from device to host is needed to return the final result.

AidanBeltonS avatar Jan 18 '24 14:01 AidanBeltonS

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant:

I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays.

An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime.

Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution?

AidanBeltonS avatar Feb 06 '24 12:02 AidanBeltonS

I copied my feedback from the original PR (AidanBeltonS#30 (comment)) since it is still relevant: I think this is a good addition. We had a similar approach in an older PR #1106 that was superseded by the unified memory design currently in mainline. Tests on this PR showed reduced host overheads especially for small input arrays. An issue I see with this approach is compatibility for devices without USM memory support. This could be solved by adding a third case that uses the existing buffer-based approach as a fallback.

This is an interesting problem. The current approach passes memory as pointers, this would be made more complicated with a buffer option. As you would have to query if the device supported USM allocations (a runtime check) then execute either a ptr or buffer version of the kernel. This would require two instantiations for each kernel to support both memory types at runtime.

Do you have some existing infrastructure in oneDPL to handle these kinds of cases or propose another solution?

Could we use __usm_host_or_buffer_accessor instead of the current pointer solution?

julianmi avatar Feb 06 '24 17:02 julianmi

@AidanBeltonS Please rebase this since https://github.com/oneapi-src/oneDPL/pull/1410 has been merged. I think we can get this merged after the rebase and testing.

julianmi avatar Mar 15 '24 11:03 julianmi