cub icon indicating copy to clipboard operation
cub copied to clipboard

Abstract the decoupled-lookback pattern into a maintainable, stable, public API

Open cwharris opened this issue 5 years ago • 9 comments

Prefix sums are incredibly useful, and CUB provides both inclusive and exclusive variants of device-wide and block-wide public APIs for these tools. However, the need for device-wide scans is not limited to basic inclusive and exclusive scans. Other algorithms, such as syntax parsing, regex matching/finite state machines, copying ranges of inputs dependent up some state (ascending/descending), etc, would also benefit from device-wide scan building blocks. Some, if not all, of these use cases can be handled using the current public facing device-wide APIs. However, consuming these APIs all but requires the consumer to allocate N * sizeof(OutputT) bytes for N inputs, which is potentially a very large amount of memory.

The block-level scan APIs (BlockScan) already have a powerful callback parameter which can be (and is) used to implement device-wide scans. However, utilizing this callback is non-trivial, at best. Internally, CUB has an excellent implementation of this callback, and an associated tile state type which makes using the callback variants of BlockScan's APIs very easy. It would be, at the very least, nice to have building blocks such as these on the public facing APIs, as utilizing multiple device-level scans can consume large amounts of memory.

To be specific, public-API versions of TilePrefixCallbackOp and ScanTileState would be very useful.

cwharris avatar Oct 27 '20 23:10 cwharris

My concern is that these are implementation details for how the current scan algorithms are written. If we make them part of CUB's public API and our implementation changes, we'd be committed to maintaining them or putting them through deprecation. Deprecating wouldn't be too much of a burden for us, but still -- exposing our implementation details as public API doesn't seem like the right move here.

There's also the issue of where they'd live. We don't have an existing way to expose these sorts of algorithm-specific helpers, so that'd take some extra consideration.

So it's not impossible, but it's unusual and wouldn't be a straightforward fit into CUB's public APIs.

Alternative idea -- these utilities are about ~500 lines of code combined and only use public APIs. Why not just include a copy of them with your algorithm? That way you won't be tied to our implementation details, and you won't have to do anything if we need to change them.

alliepiper avatar Oct 28 '20 02:10 alliepiper

Copy-paste isn't ideal, but sgtm. :)

cwharris avatar Oct 28 '20 03:10 cwharris

Not sure if I understand this correctly. But if this is about abstracting the mechanism of the single pass prefix scan [1] and exposing the functionality as a generic function, then I would agree that this indeed would be a useful building block. I have had the need for this a few times before as well.

Basically, it would present itself in many cases as a more efficient mechanism to "forward-propagate" information, all in a single pass, rather than materializing intermediate results (the prefix scan results), just due to lacking this functionality being provided.

[1] Single-pass Parallel Prefix Scan with Decoupled Look-back

elstehle avatar Oct 28 '20 07:10 elstehle

A generic interface for this sort of thing would be useful, as the decoupled lookback technique is generally useful.

cc @canonizer, who recently used this technique in #204 and may have some input.

Let's discuss what that API would look like, if that's something ya'll would like to pursue. I can't fit this into my current workload, but I'd be happy to discuss it and land a PR.

alliepiper avatar Oct 28 '20 16:10 alliepiper

In my case, the decoupled look-back implementation is different in at least 2 aspects from the generic one.

  • Instead of a single decoupled look-back, a large number of decoupled look-backs (currently 256) are performed, which are distributed among threads in the block.
  • My implementation uses 2 higher-order bits to indicate the value status (not ready / local sum ready / global sum ready), and as a result, doesn't cover the full range of values. However, it is fine for the sorting use case (multiple kernels are used if the input is too large, so there is no overflow), and may also be useful in other applications if the prefix sum is used for element counts.

Currently, my decoupled look-back implementation is part of the onesweep kernel agent. However, I'm considering making it standalone (within CUB).

canonizer avatar Oct 28 '20 17:10 canonizer

However, I'm considering making it standalone (within CUB).

This would be cool.

Do you think the existing decoupled-lookback scan implementation would benefit from these changes?

alliepiper avatar Oct 28 '20 18:10 alliepiper

No, I don't expect much benefit for the case where only a single decoupled look-back is performed.

canonizer avatar Oct 28 '20 21:10 canonizer

Probably the easiest would be to just expose TilePrefixCallbackOp and ScanTileState and provide documentation and examples on how to use them together with the BlockScan; then keep BlockScan compatible with it. I believe their usage is likely already as simple as it can be and I don't see a simpler way to support generic kernel fusion without more invasive changes to CUB. I could maybe take this on during the holidays towards the end of the year.

@canonizer do you have any insights from the sort? E.g., could this be extended to cover your use case as well? I'll link @cwharris's PR to illustrate example usage: https://github.com/rapidsai/cudf/pull/6593/files#diff-078c8ec60b1b0477dfebb86e00565fdcba5b56095aeb1cdf54337c49e3d84a2cR99-R124

elstehle avatar Nov 27 '20 06:11 elstehle

At the very least, my implementation could use a similar interface.

Regarding using TilePrefixCallbackOp and ScanTileState (or versions thereof) in my case, it's possible provided that it doesn't negatively affect performance.

As described in one of the previous comments, there are 2 differences in my implementation of decoupled look-back:

  • supporting many independent decoupled look-backs per thread block and distributing them among threads;
  • using 2 bits of a 32-bit integer to indicate the status, with 30 bits used for the value.

canonizer avatar Nov 27 '20 22:11 canonizer

#699 and #700 may be of interest to folks in this conversation.

(@miscco, is your assignment here leftover from the triage marathon, or are you working on this?)

alliepiper avatar May 24 '23 14:05 alliepiper

That is definitely a leftover

miscco avatar May 24 '23 14:05 miscco

@cwharris the API is public now, please take a look at the example.

gevtushenko avatar May 26 '23 05:05 gevtushenko