Opaque-typed persistent arrays in GPU programs fail to compile due to missing headers

Open kylosus opened this issue 2 years ago • 0 comments

Describe the bug Kind of a multifaceted bug. Applying an auto_optimize pass to an SDFG with GPU transformation makes many arrays in sdfg.arrays persistent, who are declared in state structs in both .cpp and .cu files, even if the corresponding source file does not use them. This becomes an issue is the SDFG has arrays with opaque symbols that are declared externally and likely other external types too, as the needed headers will not be included in the code, causing a compile error.

To Reproduce Consider the following MPI program with MPI_Request

import dace as dc
import numpy as np

from dace.transformation.auto import auto_optimize as aopt

MPI_Request = dc.opaque("MPI_Request")

@dc.program()
def distr(A: dc.float64[10]):
    req = np.empty((2,), dtype=MPI_Request)

    dc.comm.Isend(A[1], 0, 1, req[0])
    dc.comm.Irecv(A[-1], 0, 1, req[1])
    dc.comm.Waitall(req)

if __name__ == '__main__':
    A = np.random.rand(10)

    sdfg = distr.to_sdfg()
    sdfg.apply_gpu_transformations()
    sdfg = aopt.auto_optimize(sdfg, dc.dtypes.DeviceType.GPU)

    sdfg(A=A)

This generates the following state struct in both distr.cpp and distr.cu

struct distr_state_t {
    dace::cuda::Context *gpu_context;
    int __0___tmp1;
    int __0___tmp2;
    int __0___tmp3;
    int __0___tmp4;
    MPI_Request * __restrict__ __0_req;
    double * __restrict__ __0_gpu_A;
    double * __restrict__ __0_gpu_A_0;
};

the latter of which fails to compile as it's missing mpi.h

.dacecache/distr/src/cuda/distr_cuda.cu(12): error: identifier "MPI_Request" is undefined

Additional context and Discussion Couple of things:

Symbols declared with dc.opaque have no way to list their (header) dependencies. Such symbols hinge on headers to be included by library nodes used in DaCe program - MPI_Request in tests works by chance because it's used alongside MPI library nodes and only in CPU programs. Maybe an optional argument to dc.opaque would be appropriate, or we could have dace libraries export types such as MPI_Request
Most of everything in the generated state struct isn't relevant to GPU code; the sample code above only access state->gpu_context, and arrays in the .cu file are passed around as direct pointers. Maybe it would be better to only declare this struct in CPU code and decouple it from dace::cuda::Context *gpu_context
Another option would be to declare the struct in a separate header file that includes every header file, but I don't know how clean of a solution that would be
C++20 modules?

Side point, but I also noticed the generated code sets the storage type of MPI_Request req to GPU_Global even though it's only used in CPU code, and allocates it with cudaMalloc. I'll create a separate issue on this later.

Nov 11 '23 13:11 kylosus