Opaque-typed persistent arrays in GPU programs fail to compile due to missing headers
Describe the bug
Kind of a multifaceted bug. Applying an auto_optimize pass to an SDFG with GPU transformation makes many arrays in sdfg.arrays persistent, who are declared in state structs in both .cpp and .cu files, even if the corresponding source file does not use them. This becomes an issue is the SDFG has arrays with opaque symbols that are declared externally and likely other external types too, as the needed headers will not be included in the code, causing a compile error.
To Reproduce
Consider the following MPI program with MPI_Request
import dace as dc
import numpy as np
from dace.transformation.auto import auto_optimize as aopt
MPI_Request = dc.opaque("MPI_Request")
@dc.program()
def distr(A: dc.float64[10]):
req = np.empty((2,), dtype=MPI_Request)
dc.comm.Isend(A[1], 0, 1, req[0])
dc.comm.Irecv(A[-1], 0, 1, req[1])
dc.comm.Waitall(req)
if __name__ == '__main__':
A = np.random.rand(10)
sdfg = distr.to_sdfg()
sdfg.apply_gpu_transformations()
sdfg = aopt.auto_optimize(sdfg, dc.dtypes.DeviceType.GPU)
sdfg(A=A)
This generates the following state struct in both distr.cpp and distr.cu
struct distr_state_t {
dace::cuda::Context *gpu_context;
int __0___tmp1;
int __0___tmp2;
int __0___tmp3;
int __0___tmp4;
MPI_Request * __restrict__ __0_req;
double * __restrict__ __0_gpu_A;
double * __restrict__ __0_gpu_A_0;
};
the latter of which fails to compile as it's missing mpi.h
.dacecache/distr/src/cuda/distr_cuda.cu(12): error: identifier "MPI_Request" is undefined
Additional context and Discussion Couple of things:
- Symbols declared with
dc.opaquehave no way to list their (header) dependencies. Such symbols hinge on headers to be included by library nodes used in DaCe program -MPI_Requestin tests works by chance because it's used alongside MPI library nodes and only in CPU programs. Maybe an optional argument todc.opaquewould be appropriate, or we could have dace libraries export types such asMPI_Request - Most of everything in the generated state struct isn't relevant to GPU code; the sample code above only access
state->gpu_context, and arrays in the.cufile are passed around as direct pointers. Maybe it would be better to only declare this struct in CPU code and decouple it fromdace::cuda::Context *gpu_context - Another option would be to declare the struct in a separate header file that includes every header file, but I don't know how clean of a solution that would be
- C++20 modules?
Side point, but I also noticed the generated code sets the storage type of MPI_Request req to GPU_Global even though it's only used in CPU code, and allocates it with cudaMalloc. I'll create a separate issue on this later.