Switch to hipblaslt from rocblas
This issue is for adding functionality for switch to hipblaslt from rocblas.
Talked to Jimmy from the hipBLASLt team for issue seen with hipBLASLt not finding the solution for FP8 inputs.
There are only few solutions for TT F8 input,F8output. Those solutions only support N >= 16.
N=16 can find solution.
./hipblaslt-bench -m 8 -n 16 -k 8 --stride_a 0 --stride_b 0 --stride_c 0 --stride_d 0 --stride_e 0 --alpha 1 --beta 0 --transA T --transB T --batch_count 1 --a_type f8_r --b_type f8_r --c_type f8_r --d_type f8_r --compute_type f32_r --scale_type f32_r
He asked:
Do you really need to run this kind of small size?
Edit: Discussed with the team, feedback was that we need hipblaslt to work for this. Opened a JIRA ticket for this.
Did some performance runs for bert_base_cased_1 to determine performance of rocblas vs hipblaslt: Performance_hipBLASLt_bert_base_cased.xlsx
~The numbers are quite similar with the worst degradation for batch size 1 at 8%~.
Edit: Turns out MLIR did not get disabled during the runs, so the performance numbers are not reflective of rocblas vs hipblaslt changes.
Looking at make check failures for hipblaslt branch...
The following tests FAILED:
94 - test_gpu_gemm_tune (Failed)
348 - test_verify_general (Failed)
350 - test_verify_conv (Failed)
351 - test_verify_gemm (Failed)
One failure of interest:
[ RUN ] test_conv<migraphx::shape::fp8e4m3fnuz_type>
[Warning] : MIGraphX has BETA support for FP8. Using FP8 may result in incorrect final outputs
module: "main"
w = @param:w -> fp8e4m3fnuz_type, {4, 3, 3, 3}, {27, 9, 3, 1}
x = @param:x -> fp8e4m3fnuz_type, {4, 3, 3, 3}, {27, 9, 3, 1}
@2 = allocate[shape=fp8e4m3fnuz_type, {4, 4, 1, 1}, {4, 1, 1, 1},buf_type=nullopt] -> fp8e4m3fnuz_type, {4, 4, 1, 1}, {4, 1, 1, 1}
@3 = gpu::miopen_op[op=gpu::convolution[op={padding={0, 0, 0, 0},stride={1, 1},dilation={1, 1},group=1,padding_mode=0},solution_object={binary_object: 0},algo=0,solution_id=0]](x,w,@2) -> fp8e4m3fnuz_type, {4, 4, 1, 1}, {4, 1, 1, 1}
FAILED: test_conv<migraphx::shape::fp8e4m3fnuz_type>
what(): /code/AMDMIGraphX/AMDMIGraphX/src/targets/gpu/include/migraphx/gpu/miopen.hpp:147: make_tensor: MAKE_TENSOR: unsupported type
CMake Error at gdb/test_test_verify_conv/run.cmake:16 (message):
Test failed
Addressing comments on the PR.
Did some more performance runs for bert_base_cased_1 to determine performance of rocblas vs hipblaslt for FP32 and FP16. FP16 numbers show a perf improvement of ~19% to 47% depending on batch size. FP32 numbers have a degradation of about 1.32% to 4.15% (for batch sizes 1, 2, 8), with batch 4 showing performance improvement of ~4%. I have also collected total times and GPU GEMM time numbers in the runs.