AMDMIGraphX icon indicating copy to clipboard operation
AMDMIGraphX copied to clipboard

Switch to hipblaslt from rocblas

Open ahsan-ca opened this issue 1 year ago • 3 comments

This issue is for adding functionality for switch to hipblaslt from rocblas.

ahsan-ca avatar Apr 24 '24 18:04 ahsan-ca

Talked to Jimmy from the hipBLASLt team for issue seen with hipBLASLt not finding the solution for FP8 inputs.

There are only few solutions for TT F8 input,F8output. Those solutions only support N >= 16.
N=16 can find solution.
./hipblaslt-bench -m 8 -n 16 -k 8 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0 --stride_e 0 --alpha 1 --beta 0 --transA T --transB T --batch_count 1 --a_type f8_r --b_type f8_r --c_type f8_r --d_type f8_r --compute_type f32_r --scale_type f32_r

He asked:

Do you really need to run this kind of small size?

Edit: Discussed with the team, feedback was that we need hipblaslt to work for this. Opened a JIRA ticket for this.

ahsan-ca avatar Jul 31 '24 19:07 ahsan-ca

Did some performance runs for bert_base_cased_1 to determine performance of rocblas vs hipblaslt: Performance_hipBLASLt_bert_base_cased.xlsx

~The numbers are quite similar with the worst degradation for batch size 1 at 8%~.

Edit: Turns out MLIR did not get disabled during the runs, so the performance numbers are not reflective of rocblas vs hipblaslt changes.

ahsan-ca avatar Aug 02 '24 16:08 ahsan-ca

Looking at make check failures for hipblaslt branch...

The following tests FAILED:
         94 - test_gpu_gemm_tune (Failed)
        348 - test_verify_general (Failed)
        350 - test_verify_conv (Failed)
        351 - test_verify_gemm (Failed)

One failure of interest:

[   RUN    ] test_conv<migraphx::shape::fp8e4m3fnuz_type>
[Warning] : MIGraphX has BETA support for FP8. Using FP8 may result in incorrect final outputs

module: "main"
w = @param:w -> fp8e4m3fnuz_type, {4, 3, 3, 3}, {27, 9, 3, 1}
x = @param:x -> fp8e4m3fnuz_type, {4, 3, 3, 3}, {27, 9, 3, 1}
@2 = allocate[shape=fp8e4m3fnuz_type, {4, 4, 1, 1}, {4, 1, 1, 1},buf_type=nullopt] -> fp8e4m3fnuz_type, {4, 4, 1, 1}, {4, 1, 1, 1}
@3 = gpu::miopen_op[op=gpu::convolution[op={padding={0, 0, 0, 0},stride={1, 1},dilation={1, 1},group=1,padding_mode=0},solution_object={binary_object: 0},algo=0,solution_id=0]](x,w,@2) -> fp8e4m3fnuz_type, {4, 4, 1, 1}, {4, 1, 1, 1}


FAILED: test_conv<migraphx::shape::fp8e4m3fnuz_type>
    what(): /code/AMDMIGraphX/AMDMIGraphX/src/targets/gpu/include/migraphx/gpu/miopen.hpp:147: make_tensor: MAKE_TENSOR: unsupported type

CMake Error at gdb/test_test_verify_conv/run.cmake:16 (message):
  Test failed

ahsan-ca avatar Aug 02 '24 16:08 ahsan-ca

Addressing comments on the PR.

ahsan-ca avatar Aug 21 '24 18:08 ahsan-ca

Did some more performance runs for bert_base_cased_1 to determine performance of rocblas vs hipblaslt for FP32 and FP16. FP16 numbers show a perf improvement of ~19% to 47% depending on batch size. FP32 numbers have a degradation of about 1.32% to 4.15% (for batch sizes 1, 2, 8), with batch 4 showing performance improvement of ~4%. I have also collected total times and GPU GEMM time numbers in the runs.

Performance_hipBLASLt_vs_rocblas_bert_base_cased.xlsx

ahsan-ca avatar Aug 26 '24 15:08 ahsan-ca