AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

Auto-tuner infrastructure: maximal use of compute units

Open hgaspar opened this issue 4 years ago • 5 comments

Different SKUs have different number of compute units. It is highly desirable to fully utilize the compute hardware, by ensuring that each compute unit is occupied during the execution of an inference pipeline. To that effect, we need to ensure that we are launching at least as many workgroups, as the number of compute units. However, more often than not, and depending on compute/memory demands of the kernels, it is often desirable to have more than one (typically between 2 to 4) wavefronts in flight on a compute unit. And furthermore, each workgroup should launch at least as many threads as the hw wavefront size (typically 64, but there are navi SKUs that can be configured to WF size of 32.

Therefore:

a) Either compile on the target system, and query number of CUs of system, or b) if cross compiling, set the number of target number of CUs on the command line.

Either way, assume that the following is known: The number of CUs, and the WF size.

Case I: linear pipeline: In that case, we can just launch a single stream, and kernels enqueued in that stream should spread out and occupy all the CUs. Therefore the number of workgroups should be an integer multiple of the number of CUs, and each WG should launch an integer multiple of the WF size number of threads.
The two integers specified above are the tuning parameters of this problem, per layer.

case II: pipeline with independent branches:

consider: | - -----> ker(B) ----> A ----| | --------.> ker(C) --->

where ker(B) and ker(C) can be launched independently, and assume that they both "could" be launched simultaneously (e.g. when ker(A) completes), e.g. on different streams. However, what is optimal will depend on many factors, including the relative compute and memory requirements of the "branches" of kernels B and C, and the eventual merge point downstream, which will be optimal if its inputs are in cache.

In that case, there are the following possibilities that should be considered while autotuning:

1a. ker(B), ker(C) concurrently on different streams: numWG(B)/x + numWG(C)/y = num CUs 1b: First Ker(B), then ker(C): numWG(B) = x * numCUs, numWG(C) = y * numCUs 1c: First ker(C) then kern(B) : numWG(B) = x * numCUs, numWG(C) = y * numCUs

where, as always: sizeWG(B) = p * WF, sizeWG(C) = q * WF,

where x,y, p, q above are the autotuning params we need to optimize for.

The point is that we need to keep all the CUs happily busy, among all kernels that can co-execute.

1a, 1b and 1c are NOT equivalent! When the execution branches merge, the optimal is likely attained when the input data at the merge kernel are in cache! This could be a powerful heuristic.

Another powerful heuristic is to benchmark the compute and memory requirements of a WF of each of the kernels. That would give a very strong hint on what the autotuning params should be, and then autotuning can just search around that theoretical configuration. For example:

If ker(B) and ker(C) act on same data, produce same size data, which is immediate merged, but ker(B) takes 2x the time than ker(C), then it is likely that we should launch on different streams, and ker(B) should use 2x the num of CUs than kernel(C), so that they both finish concurrently.

hgaspar avatar Nov 22 '21 18:11 hgaspar

@lakhinderwalia work with @hgaspar to determine if this is still an issue and now that it is 3 years old, if additional insights have been learned

causten avatar Apr 15 '24 16:04 causten

Time flies! Yes, hopefully we can close it soon!

hgaspar avatar Apr 26 '24 14:04 hgaspar

Based on an email thread from @hgaspar, this issue could include the collection on data w.r.t LDS, Occupancy, Register spillage, WF/WG etc. Thanks.

lakhinderwalia avatar Jun 11 '24 17:06 lakhinderwalia

While one of the PRs above might fix the default WG size of Layernorm kernel. It is important to also enhance our general approach to allow a non-hard-coded size for WGs, as not just one size WG size fits all tensor shape: i.e. {type, dimensions[]}.

  • It shouldn't be a large code-change to make the WG size a tunable parameter of a kernel.
  • Tuning shouldn't be repeated unless shapes change.

lakhinderwalia avatar Jun 24 '24 17:06 lakhinderwalia