cuCollections
cuCollections copied to clipboard
[FEATURE]: Add multiset host-bulk retrieve APIs
Is your feature request related to a problem? Please describe.
Add multiset host-bulk retrieve APIs
Describe the solution you'd like
The basic API to add:
/**
* @brief Retrieves the matched key in the multiset corresponding to all probe keys in the range
* `[first, last)`
*
* For key `k = *(first + i)` has matches in the multiset, copies `k` and each of its associated values to
* unspecified locations in `output_probe + *` and `output_match + *`. Else, does nothing.
*
* @note This function synchronizes the given stream. For asynchronous execution use
* `retrieve_async`.
* @note Behavior is undefined if the size of the output range exceeds
* `std::distance(output_*_begin, output_*_end)`.
*
* @tparam InputIt Device accessible input iterator
* @tparam OutputIt1 Device accessible output iterator whose `value_type` can be constructed from
* probe key type
* @tparam OutputIt2 Device accessible output iterator whose `value_type` can be constructed from
* `Key`
*
* @param first Beginning of the sequence of probe keys
* @param last End of the sequence of probe keys
* @param output_probe Beginning of the sequence of the probe keys that have a match
* @param output_match Beginning of the sequence of the matched keys
* @param stream CUDA stream used to retrieve
*
* @return Pair of the iterators indicating the last elements in the output
*/
template <typename InputIt, typename OutputIt1, typename OutputIt2>
std::pair<OutputIt1, OutputIt2> retrieve(InputIt first,
InputIt last,
OutputIt1 output_probe,
OutputIt2 output_match,
cuda_stream_ref stream = {}) const;
Two APIs desired by libcudf:
/**
* @brief Asynchronously retrieves the matched key in the set corresponding to all probe keys in
* the range `[first, last)`
*
* For key `k = *(first + i)` has matches in the multiset, copies `k` and each of its associated values to
* unspecified locations in `output_probe + *` and `output_match + *`. Else, does nothing.
*
* @note Behavior is undefined if the size of the output range exceeds
* `std::distance(output_*_begin, output_*_end)`.
*
* @tparam InputIt Device accessible input iterator
* @tparam OutputIt1 Device accessible output iterator whose `value_type` can be constructed from
* probe key type
* @tparam OutputIt2 Device accessible output iterator whose `value_type` can be constructed from
* `Key`
* @tparam ProbeKeyEqual Binary callable equal type
* @tparam ProbeHash Unary callable hasher type that can be constructed from
* an integer value
*
* @param first Beginning of the sequence of probe keys
* @param last End of the sequence of probe keys
* @param probe_key_equal The binary function to compare set keys and probe keys for equality
* @param probe_hash The unary function to hash probe keys
* @param output_probe Beginning of the sequence of the probe keys that have a match
* @param output_match Beginning of the sequence of the matched keys
* @param stream CUDA stream used to retrieve
*
* @return Pair of the iterators indicating the last elements in the output
*/
template <typename InputIt, typename OutputIt, typename ProbeKeyEqual, typename ProbeHash>
std::pair<OutputIt1, OutputIt2> retrieve(InputIt first,
InputIt last,
ProbeKeyEqual const& probe_key_equal,
ProbeHash const& probe_hash,
OutputIt1 output_probe,
OutputIt2 output_match,
cuda_stream_ref stream = {}) const;
/**
* @brief Asynchronously retrieves the matched key in the set corresponding to all probe keys in
* the range `[first, last)`
*
* For key `k = *(first + i)` has matches in the multiset, copies `k` and each of its associated
* values to unspecified locations in `output_probe + *` and `output_match + *`. Otherwise,
* copies `key` and `empty_key_sentinel`.
*
* @note Behavior is undefined if the size of the output range exceeds
* `std::distance(output_*_begin, output_*_end)`.
*
* @tparam InputIt Device accessible input iterator
* @tparam OutputIt1 Device accessible output iterator whose `value_type` can be constructed from
* probe key type
* @tparam OutputIt2 Device accessible output iterator whose `value_type` can be constructed from
* `Key`
* @tparam ProbeKeyEqual Binary callable equal type
* @tparam ProbeHash Unary callable hasher type that can be constructed from
* an integer value
*
* @param first Beginning of the sequence of probe keys
* @param last End of the sequence of probe keys
* @param probe_key_equal The binary function to compare set keys and probe keys for equality
* @param probe_hash The unary function to hash probe keys
* @param output_probe Beginning of the sequence of the probe keys that have a match
* @param output_match Beginning of the sequence of the matched keys
* @param stream CUDA stream used to retrieve
*
* @return Pair of the iterators indicating the last elements in the output
*/
template <typename InputIt, typename OutputIt, typename ProbeKeyEqual, typename ProbeHash>
std::pair<OutputIt1, OutputIt2> retrieve_outer(InputIt first,
InputIt last,
ProbeKeyEqual const& probe_key_equal,
ProbeHash const& probe_hash,
OutputIt1 output_probe,
OutputIt2 output_match,
cuda_stream_ref stream = {}) const;
Describe alternatives you've considered
- Can the
_outervariant be represented in a more natural expression?
Can we use cub::DeviceSelect in this case?
Additional context
-
retrieveAPIs are always synchronous - The most challenging PR requires extensive profiling and benchmarking work: launching parameter (choosing optimal block size, shared memory sizes, etc), buffer size, probing/flushing CG sizes to tune
The return pair should be std::pair instead of cuda::std::pair since they host APIs.