Tests fail on POWER machines
- testsuite-run-fast fails on POWER9 and POWER10 with error message:
4440 Segmentation fault ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite - POWER7 dgemm microkernel includes altivec.h, which defines type
bool. Because of that, edge-case handling macros here and here use a conflicting type bool from stdbool. A possible solution is to check for POWER architecture in edge_case_macros and define _bool as int for POWER and bool for other architectures. - POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in #620). Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with
-s power10. Is it intentional?
POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?
@ivan23kor can QEMU be used to test POWER 7/8/9/10? If so what are the necessary flags so that we can add Travis CI tests?
@devinamatthews unfortunately I don't know about QEMU as I am running on actual hardware.
Looks like this is happening on POWER9 after https://github.com/flame/blis/commit/ee9ff988c49f16696679d4c6cd3dcfcac7295be7
This is the stack trace :
Program received signal SIGSEGV, Segmentation fault.
bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608, cs_c=270180544,
data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138
138 INSERT_GENTFUNC_BASIC3( gemmtrsmbb_l, BLIS_CNAME_INFIX, BLIS_REF_SUFFIX, BLIS_TRSM_L_UKR )
(gdb) bt
#0 bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608,
cs_c=270180544, data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138
#1 0x000000001010b2a8 in bli_sgemmtrsm_l_ukernel (m=4, n=16, k=100, alpha=0x7fffffffccf0, a1x=0x7ffff6ef10c0, a11=0x7ffff6ef1700, bx1=0x7fffebd51098, b11=0x7fffebd52998,
c11=0x101aa0c0, rs_c=16, cs_c=1, data=0x7fffffffc200, cntx=0x101a08c0) at frame/3/bli_l3_ukr_tapi.c:124
#2 0x000000001003da60 in bli_gemmtrsm_ukernel (alpha=
So there is no special TRSM kernel in the power9 directory. I could see only specialized SGEMM and DGEMM kernel for POWER9.
Looks like this is happening on POWER9 after https://github.com/flame/blis/commit/ee9ff988c49f16696679d4c6cd3dcfcac7295be7
Having done a git bisect on a POWER9 system I can confirm this too.
Further investigation points at missing modification of ref_kernels/3/bb/bli_gemmtrsmbb_ref.c (parameters m&n are not introduced)
After changing that file similar to https://github.com/flame/blis/commit/ee9ff988c49f16696679d4c6cd3dcfcac7295be7#diff-512c6a50b6244efae7b93e599fb1b295377718d298cac465e3f83db3718c4b03 I see many failures of the kind
% blis_<dt><op>_<params>_<stor> m n k gflops resid result
blis_cgemm1m_nn_ccc 100 100 100 1.34 1.97e-08 PASS
blis_cgemm_nn_ccc 100 100 100 3.64 1.09e-08 PASS
% blis_<dt><op>_<params>_<stor> m n k gflops resid result
blis_zgemm1m_nn_ccc 100 100 100 17.96 3.82e-03 FAILURE
blis_zgemm_nn_ccc 100 100 100 4.16 2.25e-17 PASS
I.e. Only the "z" datatype on the "*1m" is affected.
Using the "generic" configuration works.
@fgvanzee the issue seems to be missing edge case handling in gemmtrsmbb_ref. Since you fixed the other trsm/gemmtrsm kernels can you fix this?
I'll take a look at it.
Uhh, am I missing something, @devinamatthews?
$ ls ref_kernels/3
bli_gemm_ref.c bli_gemmsup_ref.c bli_gemmtrsm_ref.c bli_trsm_ref.c old
It's not clear what I need to fix.
The bb kernels live in some strangebplace. On Jul 21, 2022, at 6:08 PM, "Field G. Van Zee" @.@.>> wrote:
[EXTERNAL SENDER]
Uhh, am I missing something, @devinamatthewshttps://github.com/devinamatthews?
$ ls ref_kernels/3 bli_gemm_ref.c bli_gemmsup_ref.c bli_gemmtrsm_ref.c bli_trsm_ref.c old
— Reply to this email directly, view it on GitHubhttps://github.com/flame/blis/issues/621#issuecomment-1192016426, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIAZIODBD5UQXDEIJ52MHDVVHJ6HANCNFSM5QM5ICKQ. You are receiving this because you were mentioned.Message ID: @.***>
The bb kernels live in some strangebplace.
Oh wait, didn't we fold them into the conventional reference microkernels?
@devinamatthews Ah, I think I see the problem now. It's not that the gemmtrsm ukernels lack edge-case handling; it's that they haven't been updated according to the latest way that the bb functionality has been folded in. (By contrast, it appears that the gemm ukernel has been updated. So I'll use that as my guide.)
I'll try to work on this more tomorrow.
@fgvanzee gemmtrsm reference kernels has been updated:
PASTEMAC(ch,bcastbbs_mxn) \
( \
m, \
n, \
b11, rs_b, cs_b \
); \
etc. @Flamefire did you test the tip of the master branch (currently af3a41e)?
@devinamatthews Ah, yes, I had previously overlooked this:
const inc_t rs_b = packnr; \
const inc_t cs_b = bli_cntx_get_blksz_def_dt( dt, BLIS_BBN, cntx ); \
I successfully tested af3a41e02534befdae026377592ce437bab83023 on Power9LE
Do you know which commit fixed that?
The Power9/10 kernels use a "broadcast packing" format, which messed up a lot of the older code, or rather, required some bespoke code which interacted poorly with [cz]gemm1m. I rewrote all of the reference packing kernels in ae10d9495486f589ed0320f0151b2d195574f1cf in part to streamline this issue (note that that commit has a bug which was later fixed in 667f201b7871da68622027d02bd6b7da3262f8e8.
I guess we can close the issue then?
Yes, thanks!
Thanks for fixing this. make check now shows only these failures on POWER9.
Filename: out.zblat3
******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY ******* ******* ZHEMM FAILED ON CALL NUMBER: 274: ZHEMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .
ZSYMM PASSED THE TESTS OF ERROR-EXITS
******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY ******* ******* ZSYMM FAILED ON CALL NUMBER: 274: ZSYMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .
@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.
@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:
#include "blis.h"
#include <stdio.h>
#include <string.h>
int main(int argc, char** argv)
{
obj_t A, B, C, a, b, c;
bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &A );
bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &B );
bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &C );
bli_setm( &BLIS_ONE, &A );
bli_setm( &BLIS_ONE, &B );
bli_setm( &BLIS_ZERO, &C );
bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &A ), 1, 2, &a );
bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &B ), 1, 2, &b );
bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &C ), 1, 2, &c );
bli_obj_set_struc( BLIS_HERMITIAN, &a );
bli_obj_set_uplo( BLIS_UPPER, &a );
bli_printm( "before:", &C, "%4.1f", "" );
bli_hemm( BLIS_RIGHT, &BLIS_ONE, &a, &b, &BLIS_ZERO, &c );
bli_printm( "after:", &C, "%4.1f", "" );
return 0;
}
It should print:
before:
0.0 + 0.0 0.0 + 0.0
0.0 + 0.0 0.0 + 0.0
after:
1.0 + 0.0 0.0 + 0.0
0.0 + 0.0 0.0 + 0.0
Any change in the zero values in the "after" matrix would be a problem.
Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb packm functionality into a single reference kernel) extends bb support to triangular packing, any bb-utilizing subconfiguration needs to #define BLIS_DISABLE_TRMM_RIGHT, as is done for power9:
$ grep -R BLIS_DISABLE_TRMM_RIGHT config
./config/old/haswellbb/bli_family_haswell.h:#define BLIS_DISABLE_TRMM_RIGHT
./config/power9/bli_family_power9.h:#define BLIS_DISABLE_TRMM_RIGHT
This macro, as its name suggests, disables right-side trmm, expressing it instead in terms of left-side trmm. This forces the triangular matrix to always be on the left, and therefore a column-preferential gemm microkernel will always be guaranteed to assume broadcast values to exist in the right-hand matrix.
Like power9, power10 also seems to use explicit broadcasting of B during packing, but it fails to set the above macro. That could cause issues for trmm, and possibly non-deterministic behavior elsewhere.
And hemm/symm too right?
Yes, that's right. hemm/symm also require their own macros (because the Hermitian and symmetric packing formats also would presumably not support explicit broadcast). And trmm3 (for same reasons as trmm).
Also just noticed that power10's subconfig registers preferences for complex gemm ukernels without registering any actual complex ukernels. :thinking:
void bli_cntx_init_power10( cntx_t* cntx )
{
blksz_t blkszs[ BLIS_NUM_BLKSZS ];
// Set default kernel blocksizes and functions.
bli_cntx_init_power10_ref( cntx );
// -------------------------------------------------------------------------
// Update the context with optimized native gemm micro-kernels.
bli_cntx_set_ukrs
(
cntx,
// level-3
BLIS_GEMM_UKR, BLIS_FLOAT, bli_sgemm_power10_mma_8x16,
BLIS_GEMM_UKR, BLIS_DOUBLE, bli_dgemm_power10_mma_8x8,
BLIS_VA_END
);
// Update the context with storage preferences.
bli_cntx_set_ukr_prefs
(
cntx,
// level-3
BLIS_GEMM_UKR_ROW_PREF, BLIS_FLOAT, TRUE,
BLIS_GEMM_UKR_ROW_PREF, BLIS_DOUBLE, TRUE,
BLIS_GEMM_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
BLIS_GEMM_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_FLOAT, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_FLOAT, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DOUBLE, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DOUBLE, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
BLIS_VA_END
);
Could be harmless since 1m method code should always read the preference of the real domain ukernel. But nonetheless I think those extra entries should be nixed unless there's something I missing.
EDIT: It would appear that the gemm1m virtual microkernel is doing The Right Thing:tm: by grabbing the preference of the real domain ukernel, so the above is likely not a correctness bug:
PASTECH(chr,gemm_ukr_ft) \
rgemm_ukr = bli_cntx_get_ukr_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
const bool col_pref = bli_cntx_ukr_prefers_cols_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
Although it's still confusing AF to anyone who isn't ~~me and @devinamatthews~~ a core BLIS developer and therefore should be cleaned up.
Another observation: power10 uses row-preferential ukernels (which use broadcast values of A, not B) but has broadcast-B values defined in config/power10/bli_kernel_defs_power10.h. :thinking:
It's a wonder any of this power10 code works at all.
Okay, I'm going to consult with @nicholaiTukanov, the original author of the power10 subconfig and kernel set, to make sure I understand how things should be registered before I begin suggesting any bugfixes.
@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:
It should print:
before: 0.0 + 0.0 0.0 + 0.0 0.0 + 0.0 0.0 + 0.0 after: 1.0 + 0.0 0.0 + 0.0 0.0 + 0.0 0.0 + 0.0Any change in the zero values in the "after" matrix would be a problem.
Yes, I could see the same result.
@RajalakshmiSR Maybe try compiling with
-fstack-protector-strongwhen using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.
Still the same issue. Tried it like CFLAGS="-fstack-protector-strong" ./configure power9
Yes, I could see the same result.
@RajalakshmiSR strange, that is exactly the test that is supposed to be failing in zblat3. I might write a Fortran version just in case that makes a difference.
Here's a Fortran version. It can't get much more similar to the failing test.
program main
double complex A(2,2), B(2,2), C(2,2)
double complex alpha, beta
A(1,1) = (1.0,0.0)
A(2,1) = (1.0,0.0)
A(1,2) = (1.0,0.0)
A(2,2) = (1.0,0.0)
B(1,1) = (1.0,0.0)
B(2,1) = (1.0,0.0)
B(1,2) = (1.0,0.0)
B(2,2) = (1.0,0.0)
C(1,1) = (0.0,0.0)
C(2,1) = (0.0,0.0)
C(1,2) = (0.0,0.0)
C(2,2) = (0.0,0.0)
alpha = (1.0,0.0)
beta = (0.0,0.0)
WRITE(*,*)'before:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)
CALL ZHEMM('R','U',1,1,alpha,A,2,B,2,beta,C,2)
WRITE(*,*)'after:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)
END PROGRAM MAIN
@ivan23kor I fixed what was probably a power10 bug in 5b29893, although I can't be sure how it would have manifested. If you can, please give this commit a try. (We don't have easy access to any Power hardware.)
Sorry these initial questions weren't answered in a timely manner.
POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in https://github.com/flame/blis/pull/620).
@ivan23kor Thank you for catching that, and submitting the PR fix.
Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?
For now, yes. That code could be transitioned from being a sandbox to being an addon, but even then you'd have to explicitly request it via configure.
POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?
Not at this time. We don't have any Power expertise or hardware access in our core development group. The collaborator who wrote the power9 dgemm microkernel has moved on to other adventures, although he still stays involved in our community. Care to comment, @nicholaiTukanov?