cuCollections icon indicating copy to clipboard operation
cuCollections copied to clipboard

[FEATURE]: Add multiset host-bulk retrieve APIs

Open PointKernel opened this issue 1 year ago • 1 comments

Is your feature request related to a problem? Please describe.

Add multiset host-bulk retrieve APIs

Describe the solution you'd like

The basic API to add:

  /**
   * @brief Retrieves the matched key in the multiset corresponding to all probe keys in the range
   * `[first, last)`
   *
   * For key `k = *(first + i)` has matches in the multiset, copies `k` and each of its associated values to
   * unspecified locations in `output_probe + *` and `output_match + *`. Else, does nothing.
   *
   * @note This function synchronizes the given stream. For asynchronous execution use
   * `retrieve_async`.
   * @note Behavior is undefined if the size of the output range exceeds
   * `std::distance(output_*_begin, output_*_end)`.
   *
   * @tparam InputIt Device accessible input iterator
   * @tparam OutputIt1 Device accessible output iterator whose `value_type` can be constructed from
   * probe key type
   * @tparam OutputIt2 Device accessible output iterator whose `value_type` can be constructed from
   * `Key`
   *
   * @param first Beginning of the sequence of probe keys
   * @param last End of the sequence of probe keys
   * @param output_probe Beginning of the sequence of the probe keys that have a match
   * @param output_match Beginning of the sequence of the matched keys
   * @param stream CUDA stream used to retrieve
   *
   * @return Pair of the iterators indicating the last elements in the output
   */
  template <typename InputIt, typename OutputIt1, typename OutputIt2>
  std::pair<OutputIt1, OutputIt2> retrieve(InputIt first,
                                           InputIt last,
                                           OutputIt1 output_probe,
                                           OutputIt2 output_match,
                                           cuda_stream_ref stream = {}) const;

Two APIs desired by libcudf:

  /**
   * @brief Asynchronously retrieves the matched key in the set corresponding to all probe keys in
   * the range `[first, last)`
   *
   * For key `k = *(first + i)` has matches in the multiset, copies `k` and each of its associated values to
   * unspecified locations in `output_probe + *` and `output_match + *`. Else, does nothing.
   *
   * @note Behavior is undefined if the size of the output range exceeds
   * `std::distance(output_*_begin, output_*_end)`.
   *
   * @tparam InputIt Device accessible input iterator
   * @tparam OutputIt1 Device accessible output iterator whose `value_type` can be constructed from
   * probe key type
   * @tparam OutputIt2 Device accessible output iterator whose `value_type` can be constructed from
   * `Key`
   * @tparam ProbeKeyEqual Binary callable equal type
   * @tparam ProbeHash Unary callable hasher type that can be constructed from
   * an integer value
   *
   * @param first Beginning of the sequence of probe keys
   * @param last End of the sequence of probe keys
   * @param probe_key_equal The binary function to compare set keys and probe keys for equality
   * @param probe_hash The unary function to hash probe keys
   * @param output_probe Beginning of the sequence of the probe keys that have a match
   * @param output_match Beginning of the sequence of the matched keys
   * @param stream CUDA stream used to retrieve
   *
   * @return Pair of the iterators indicating the last elements in the output
   */
  template <typename InputIt, typename OutputIt, typename ProbeKeyEqual, typename ProbeHash>
  std::pair<OutputIt1, OutputIt2>  retrieve(InputIt first,
                                            InputIt last,
                                            ProbeKeyEqual const& probe_key_equal,
                                            ProbeHash const& probe_hash,
                                            OutputIt1 output_probe,
                                            OutputIt2 output_match,
                                            cuda_stream_ref stream = {}) const;

  /**
   * @brief Asynchronously retrieves the matched key in the set corresponding to all probe keys in
   * the range `[first, last)`
   *
   * For key `k = *(first + i)` has matches in the multiset, copies `k` and each of its associated
   * values to unspecified locations in `output_probe + *` and `output_match + *`. Otherwise,
   * copies `key` and `empty_key_sentinel`.
   *
   * @note Behavior is undefined if the size of the output range exceeds
   * `std::distance(output_*_begin, output_*_end)`.
   *
   * @tparam InputIt Device accessible input iterator
   * @tparam OutputIt1 Device accessible output iterator whose `value_type` can be constructed from
   * probe key type
   * @tparam OutputIt2 Device accessible output iterator whose `value_type` can be constructed from
   * `Key`
   * @tparam ProbeKeyEqual Binary callable equal type
   * @tparam ProbeHash Unary callable hasher type that can be constructed from
   * an integer value
   *
   * @param first Beginning of the sequence of probe keys
   * @param last End of the sequence of probe keys
   * @param probe_key_equal The binary function to compare set keys and probe keys for equality
   * @param probe_hash The unary function to hash probe keys
   * @param output_probe Beginning of the sequence of the probe keys that have a match
   * @param output_match Beginning of the sequence of the matched keys
   * @param stream CUDA stream used to retrieve
   *
   * @return Pair of the iterators indicating the last elements in the output
   */
  template <typename InputIt, typename OutputIt, typename ProbeKeyEqual, typename ProbeHash>
  std::pair<OutputIt1, OutputIt2>  retrieve_outer(InputIt first,
                                                  InputIt last,
                                                  ProbeKeyEqual const& probe_key_equal,
                                                  ProbeHash const& probe_hash,
                                                  OutputIt1 output_probe,
                                                  OutputIt2 output_match,
                                                  cuda_stream_ref stream = {}) const;

Describe alternatives you've considered

  • Can the _outer variant be represented in a more natural expression?

Can we use cub::DeviceSelect in this case?

Additional context

  • retrieve APIs are always synchronous
  • The most challenging PR requires extensive profiling and benchmarking work: launching parameter (choosing optimal block size, shared memory sizes, etc), buffer size, probing/flushing CG sizes to tune

PointKernel avatar Apr 17 '24 20:04 PointKernel

The return pair should be std::pair instead of cuda::std::pair since they host APIs.

PointKernel avatar Apr 17 '24 20:04 PointKernel