rccl [Issue]: Why does reducing stack size decrease kernel launch overhead?

Problem Description

When we use indirect function call in the kernel function as in below: https://github.com/ROCm/rccl/blob/53dcfcc5e0592fa186da38a41d19082723711043/src/device/common.h#L348

RCCL limits the size of stack to 512: https://github.com/ROCm/rccl/blob/53dcfcc5e0592fa186da38a41d19082723711043/src/init.cc#L1876 which is 1024 by default if you does not modify via hipDeviceSetLimit API.

At first I thought the modification is not that important, but it indeed has impact on performance. With the stack size of 512, collective communication on small data takes around ~10us. With the stack size of 1024, however, it takes around ~244us, which is more than x20 latency.

The code line is introduced with https://github.com/ROCm/rccl/pull/684, but there is no explanation on the situation. Does anybody know why the stack size has an impact on kernel launch overhead?

I also made the following minimal working example which you may use to reproduce the issue.

#include <cstdio>
#include <hip/amd_detail/amd_hip_runtime.h>
#include <hip/hip_runtime.h>
#include <chrono>

#define CHECK_HIP(res) \
  do { \
    hipError_t err = (res); \
    if (err != hipSuccess) { \
      fprintf(stderr, "HIP Error (%s:%d): %s (%s)\n", __FILE__, __LINE__, \
              hipGetErrorName(err), hipGetErrorString(err)); \
      exit(EXIT_FAILURE); \
    } \
  } while (0)

__device__ void subkernel0(int *a) { *a = 0xdeadbee0;}
__device__ void subkernel1(int *a) { *a = 0xdeadbee1;}
__device__ void (*subkernels[])(int *a){subkernel0, subkernel1};

__global__ void mainkernel(int *a) {
  subkernels[0](a);
}

size_t measure() {
  int warmup = 30, niter = 100;
  size_t elapsed = 0;
  for (int i = -warmup; i < niter; ++i) {
    int *a;
    CHECK_HIP(hipMalloc(&a, sizeof(int)));
    CHECK_HIP(hipMemset(a, 0, sizeof(int)));

    CHECK_HIP(hipDeviceSynchronize());
    auto start = std::chrono::high_resolution_clock::now();
    mainkernel<<<1, 1>>>(a);
    CHECK_HIP(hipDeviceSynchronize());
    auto end = std::chrono::high_resolution_clock::now();

    int b;
    CHECK_HIP(hipMemcpy(&b, a, sizeof(int), hipMemcpyDeviceToHost));
    if (b != 0xdeadbee0) {
      printf("Error: b = %x\n", b);
      exit(1);
    }
    CHECK_HIP(hipFree(a));
    if (i >= 0) {
      elapsed += std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
    }
  }
  return elapsed / niter;
}

int getStackSize() {
  size_t curStackSize;
  CHECK_HIP(hipDeviceGetLimit(&curStackSize, hipLimitStackSize));
  return curStackSize;
}

int main() {
  for (int i = 500; i <= 700; ++i) {
    CHECK_HIP(hipDeviceSetLimit(hipLimitStackSize, i));
    printf("Stack size: %d, Time: %zu ns\n", getStackSize(), measure());
    //printf("%d,%zu\n", getStackSize(), measure());
  }
  return 0;
}

Operating System

Ubuntu 20.04.5 LTS (Focal Fossa)

CPU

AMD EPYC 7413 24-Core Processor

GPU

AMD Instinct MI100

ROCm Version

ROCm 6.0.0

ROCm Component

No response

Steps to Reproduce

No response

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

No response

Additional Information

No response

Jun 18 '24 09:06 csehydrogen

Hi @csehydrogen , can you still reproduce this with the latest ROCm and RCCL?

I took the reproducer and unable to reproduce it on MI300X (similar time for different stack sizes). I am trying to find MI100 machine to run on.

Jul 17 '24 16:07 thananon

reproduced with MI250X.

Stack size: 684, Time: 12183 ns
Stack size: 685, Time: 12438 ns
Stack size: 686, Time: 12286 ns
Stack size: 687, Time: 12272 ns
Stack size: 688, Time: 12223 ns
Stack size: 689, Time: 218785 ns
Stack size: 690, Time: 218671 ns
Stack size: 691, Time: 218574 ns
Stack size: 692, Time: 218499 ns

Will create internal ticket for this.

Jul 17 '24 17:07 thananon

I no longer have access to MI100. I was unable to reproduce it on MI350X with rocm-6.1.2. Seems like it's a pre-MI300 problem.

Jul 18 '24 02:07 csehydrogen

Thank you for confirmation. We are aware of this issue on MI100/MI200 series and working on a fix. In the meantime, we recommend to NOT use indirect function call in RCCL.

Jul 18 '24 14:07 thananon