llvm [SYCL-MLIR] Use GPUModuleOp to host SYCL device code

SYCL kernels are GPUFuncOps residing in the new GPUModuleOp, keeping host code in the regular module. Functions to be used by kernels must be cloned to the GPU module.

When a kernel is found, the host code is dropped and only the GPU module is used to produce an output. The host module will be used as usual if no kernel is found.

Signed-off-by: Victor Perez [email protected]

Sep 22 '22 09:09 victor-eds

Question: why do we need to clone the kernel and device functions into a new module ? I do not understand the motivation for this PR fully. I am thinking of the current scenario where the compiler is a 2 pass compiler (device compilation pass + host compilation pass). During the device compilation pass the compiler FE is presented with the entire program. It needs to lower the SYCL kernel and any function that may be transitively called from it to a "device" module. That module can be a GPUModuleOp, so the MLIR code for the kernel can be generated directly into that module. No need to clone anything.

Can you explain the mechanics proposed in this PR to help me understand the scenario you are working on pls ?

Sep 22 '22 14:09 etiotto

Also, this is a very large PR. Would it be possible to split it to make code review easier please ?

Sep 22 '22 14:09 etiotto

Question: why do we need to clone the kernel and device functions into a new module ? I do not understand the motivation for this PR fully. I am thinking of the current scenario where the compiler is a 2 pass compiler (device compilation pass + host compilation pass). During the device compilation pass the compiler FE is presented with the entire program. It needs to lower the SYCL kernel and any function that may be transitively called from it to a "device" module. That module can be a GPUModuleOp, so the MLIR code for the kernel can be generated directly into that module. No need to clone anything.

Can you explain the mechanics proposed in this PR to help me understand the scenario you are working on pls ?

Functions are output to the host module by default; if we find it is also called from a device context, we clone it to the device module.

I think by tracking the context where each call is made and just generating each function in the required context, we could avoid this cloning boilerplate. It would imply a major revision of the PR, of course, but we could just reuse what we currently have. If that sounds better to you, I'll make the change.

Also, this is a very large PR. Would it be possible to split it to make code review easier please ?

Will do. Thanks!

Sep 22 '22 15:09 victor-eds

After this PR, test cases with SYCL kernel fails to generate LLVMIR with the following error:

LLVM ERROR: Can't add pass 'ConvertPolygeistToLLVM' restricted to 'builtin.module' on a PassManager intended to run on 'gpu.module', did you intend to nest?

Sep 28 '22 21:09 whitneywhtsang