Results 14 comments of kahakuka

MMCV_WITH_OPS=1 python3 setup.py -v bdist_wheel

@teamwong111 yes,Just like you said. But I want to provide ".whl" package. Is there any other way? The solution is setup Include in "include_package_data" Change data to false. The compilation...

The template is as follows: using ALayout = Row; using BLayout = Row; using DLayout = Row; using ELayout = Row; using DeviceOpInstance = ck::tensor_operation::device::DeviceGemmMultipleABD_Xdl_CShuffle< ck::Tuple, ck::Tuple, ck::Tuple, ELayout, ck::Tuple,...

@zjing14 Thank you for your answer.gemm case :60_gemm_multi_ABD.I referred to the modifications in your PR (https://github.com/ROCm/composable_kernel/pull/978)and added the function TransposeFromElmToDst to implement it. The layout of B is row, which...

@zjing14 yes,However, when B takes 8 at a time, the performance will be very poor.Change 1, 2, 8 here to 1, 8, 8. ![image](https://github.com/ROCm/composable_kernel/assets/40659418/3ecd25f6-c614-428f-9ffe-5b81640726fb)

@zjing14 Thank you for your answer.Not use int4x2.It quantifies fp16 into int4 according to a certain pattern and stores it in a uint32 type.The paper introduces it this way. ![image](https://github.com/ROCm/composable_kernel/assets/40659418/3823dafb-d2d4-444d-acd6-7edd5bf104f6)

@zjing14 Is it easy to implement the integration of int4+gemm on composable_kernel by referring to the method of mma in llm-awq? llm-awq:The processing of int4 dequation can refer to this...

@ilmarkov Hello, I am in version 5.7 of rocm. The 'hipIpcMemLazyEnabled PeerAccess' of the following function needs to be' 0 '. The accuracy is incorrect after changing to '0'. Excuse...

> @kahakuka感谢您的留言!您使用的是 MI200 GPU 吗? Yes, I also tried it on MI250, it's the same.