Orochi icon indicating copy to clipboard operation
Orochi copied to clipboard

oroOccupancyMaxActiveBlocksPerMultiprocessor returns hipErrorInvalidDeviceFunction always

Open AtsushiYoshimura0302 opened this issue 1 year ago • 4 comments

Hi, the API oroOccupancyMaxActiveBlocksPerMultiprocessor always fails. I tried the "main" branch and also "release/hip6.0_cuda12.2" with RX7900XTX and RTX4090 but it always failed.

This can be reproduced by just adding this to "SimpleDemo".

			int numBlocks = 0;
			oroError_t error = oroOccupancyMaxActiveBlocksPerMultiprocessor( &numBlocks, function, 128, 0 );
			printf( "occupancy api %d %d\n", error, numBlocks ); // shows occupancy api 98 0

can anyone help?

AtsushiYoshimura0302 avatar Jul 23 '24 01:07 AtsushiYoshimura0302

The function is correctly bound to HIP ( hipOccupancyMaxActiveBlocksPerMultiprocessor ) so I don't think the bug is related to Orochi. I confirm I'm reproducing the same error code: hipErrorInvalidDeviceFunction (98).

RichardGe avatar Jul 23 '24 13:07 RichardGe

@RichardGe Thank you for checking but I found out the reason. Orochi API have to use hipModuleOccupancyMaxActiveBlocksPerMultiprocessor/cuOccupancyMaxActiveBlocksPerMultiprocessor instead of hipOccupancyMaxActiveBlocksPerMultiprocessor / cudaOccupancyMaxActiveBlocksPerMultiprocessor since orochi uses runtime compilation. There is a difference in the pointer treatment between driver API and runtime API. The current binding is for runtime API.

note: https://forums.developer.nvidia.com/t/using-cudaoccupancymaxactiveblockspermultiprocessor-with-function-acquired-with-cumodulegetfunction/184191

AtsushiYoshimura0302 avatar Jul 24 '24 02:07 AtsushiYoshimura0302

I think there are some more incorrect bindings e.g. oroFuncGetAttributes()

AtsushiYoshimura0302 avatar Jul 24 '24 02:07 AtsushiYoshimura0302

I confirmed the behavior with HIP SDK6.1 and https://github.com/ROCm/rocm-examples.git (92786e2 - Add source format linting to the GitHub workflows (#140)) and https://github.com/NVIDIA/cuda-samples

AtsushiYoshimura0302 avatar Jul 24 '24 02:07 AtsushiYoshimura0302

Some clarification here:

The selection of these two functions depends on where the function pointer, which is used as one of the params, is coming from.

For example, in CUDA case:

If the function pointer is originally from something like cuModuleGetFunction() , the rest should be bound to "cu" instead of "cuda" and we cannot mix them.

Note: In CUDA, runtime API functions start with "cuda" and driver API functions start with "cu"

The same applies to HIP.

KaoCC avatar Oct 17 '24 11:10 KaoCC

Hi @AtsushiYoshimura0302 we investigated with @KaoCC , in SimpleDemo, the function is taken from oroModuleGetFunction , so you need to use oroModuleOccupancyMaxActiveBlocksPerMultiprocessor instead of oroOccupancyMaxActiveBlocksPerMultiprocessor. I tested, it worked. So, I think we can close this ticket.

RichardGe avatar Oct 17 '24 14:10 RichardGe

ah, thanks you for finding it out and checking.

AtsushiYoshimura0302 avatar Oct 17 '24 14:10 AtsushiYoshimura0302