hdk icon indicating copy to clipboard operation
hdk copied to clipboard

Add GpuSharedMemory Tests for Intel GPU

Open lmontigny opened this issue 2 years ago • 10 comments

Problem We need specific tests for Shared Memory on Intel GPU. Currently we have only shared memory test for Nvidia GPU with CUDA.

Solution This PR adds the ./Tests/GpuSharedMemoryTestIntel executable. I took the GpuSharedMemoryTest.cpp code and adapted it from Nvidia CUDA to L0

Impact It will allows us to test our SM implementation, work in progress https://github.com/intel-ai/hdk/pull/534

lmontigny avatar Jun 21 '23 13:06 lmontigny

This mostly the same as the original smem test, right? The only real difference is the kernel submission. Can we get rid of code duplication for all the codegen and helpers methods?

kurapov-peter avatar Jun 22 '23 14:06 kurapov-peter

  • Fixed style-check for CI
  • Need to fix ./Tests/GpuSharedMemoryTestIntel

lmontigny avatar Jul 04 '23 15:07 lmontigny

Current issue with GpuSharedMemoryTestIntel executable:

$ ./Tests/GpuSharedMemoryTestIntel 
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from SingleColumn
[ RUN      ] SingleColumn.VariableEntries_CountQuery_4B_Group
Adding global slm.buf.i64
PerformReductionTest - after linking
PerformReductionTest - before calling conv
PerformReductionTest - after calling conv
PerformReductionTest - before linking
compile_and_link_gpu_code - before writeSpirv
compile_and_link_gpu_code - before spv_to_bin
unknown file: Failure
C++ exception with description "L0 error: error occurred when building module, see build log for details" thrown in the test body.
[  FAILED  ] SingleColumn.VariableEntries_CountQuery_4B_Group (70 ms)

hdk_log:

 2023-07-05T10:12:06.622687 I 3915859 0 0 L0Mgr.cpp:76 Discovered 1 driver(s) for L0        platform.
 2023-07-05T10:12:06.661611 I 3915859 0 0 L0Mgr.cpp:229 L0 module build log: error: IGC     SPIRV Consumer does not support the following opcode: 65534
 ^@

IR: https://gist.github.com/lmontigny/3663842110065557e18df09ea5f658d5

@kurapov-peter Can you have a look the IR & GpuSharedMemoryTestIntel.cpp around init_smem_func L134? I took the GpuSharedMemoryTest.cpp (cuda) and modified to support L0 - I'm not sure it's 100% correct.

lmontigny avatar Jul 05 '23 15:07 lmontigny

I see that none of the functions except for the wrapper_kernel have a proper calling convention set. If that's the final representation of the kernel you convert to spirv I'd assume the error you are seeing is related, e.g. it tries to execute generate a call instruction with the default calling conv and doesn't find such in the isa.

kurapov-peter avatar Jul 05 '23 17:07 kurapov-peter

Another thing to check is that your group-by buffer size does not produce indexes that are out of the allocated SLM.

kurapov-peter avatar Jul 05 '23 17:07 kurapov-peter

The last thing that caught my eye is that there's a declaration of the init_shared_mem that doesn't match the definition, so you might be calling a non-existent function.

kurapov-peter avatar Jul 05 '23 17:07 kurapov-peter

Oh, and I think init_smem_func uses global address space instead of the shared one. This is the most probable reason you get the error.

kurapov-peter avatar Jul 05 '23 18:07 kurapov-peter

Previous Spirv IGC opcode issue fixed internally. Now looking at C++ exception with description "L0 error: kernel name is not found in the module" thrown in the test body

lmontigny avatar Jul 19 '23 15:07 lmontigny

The last thing that caught my eye is that there's a declaration of the init_shared_mem that doesn't match the definition, so you might be calling a non-existent function.

In the latest IR with the IGC fix, I don't have any declaration for init_shared_mem, is it normal? attached IR

lmontigny avatar Jul 20 '23 15:07 lmontigny

Fixed large portion of the codegen at different level, executable now running.

Next step is to fix the computation to have correct results. (currently cpu != gpu results)

hdk/omniscidb/Tests/GpuSharedMemoryTestIntel.cpp:416: Failure Expected equality of these values: cmp_result Which is: 1 0 [ FAILED ] SingleColumn.VariableEntries_CountQuery_4B_Group (247 ms) [----------] 1 test from SingleColumn (247 ms total)

lmontigny avatar Jul 21 '23 16:07 lmontigny