blis icon indicating copy to clipboard operation
blis copied to clipboard

Tests fail on POWER machines

Open ivan23kor opened this issue 3 years ago • 32 comments

  • testsuite-run-fast fails on POWER9 and POWER10 with error message: 4440 Segmentation fault ./test_libblis.x -g ./testsuite/input.general.fast -o ./testsuite/input.operations.fast > output.testsuite
  • POWER7 dgemm microkernel includes altivec.h, which defines type bool. Because of that, edge-case handling macros here and here use a conflicting type bool from stdbool. A possible solution is to check for POWER architecture in edge_case_macros and define _bool as int for POWER and bool for other architectures.
  • POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in #620). Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

ivan23kor avatar Mar 10 '22 15:03 ivan23kor

@ivan23kor can QEMU be used to test POWER 7/8/9/10? If so what are the necessary flags so that we can add Travis CI tests?

devinamatthews avatar Mar 10 '22 15:03 devinamatthews

@devinamatthews unfortunately I don't know about QEMU as I am running on actual hardware.

ivan23kor avatar Mar 18 '22 16:03 ivan23kor

Looks like this is happening on POWER9 after https://github.com/flame/blis/commit/ee9ff988c49f16696679d4c6cd3dcfcac7295be7

This is the stack trace : Program received signal SIGSEGV, Segmentation fault. bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608, cs_c=270180544, data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138 138 INSERT_GENTFUNC_BASIC3( gemmtrsmbb_l, BLIS_CNAME_INFIX, BLIS_REF_SUFFIX, BLIS_TRSM_L_UKR ) (gdb) bt #0 bli_sgemmtrsmbb_l_power9_ref (k=4, alpha=0x10, a1x=0x64, a11=0x7fffffffccf0, bx1=0x7ffff6ef10c0, b11=0x7ffff6ef1700, c11=0x7fffebd51098, rs_c=140737150003608, cs_c=270180544, data=0x10, cntx=0x1) at ref_kernels/3/bb/bli_gemmtrsmbb_ref.c:138 #1 0x000000001010b2a8 in bli_sgemmtrsm_l_ukernel (m=4, n=16, k=100, alpha=0x7fffffffccf0, a1x=0x7ffff6ef10c0, a11=0x7ffff6ef1700, bx1=0x7fffebd51098, b11=0x7fffebd52998, c11=0x101aa0c0, rs_c=16, cs_c=1, data=0x7fffffffc200, cntx=0x101a08c0) at frame/3/bli_l3_ukr_tapi.c:124 #2 0x000000001003da60 in bli_gemmtrsm_ukernel (alpha=, a1x=, a11=0x7fffffffc5d0, bx1=, b11=, c11=, cntx=0x101a08c0) at frame/3/bli_l3_ukr_oapi.c:182 #3 0x000000001000bbec in libblis_test_gemmtrsm_ukr_impl (iface=, side=, alpha=, a1x=, a11=, bx1=, b11=, c11=, cntx=0x101a08c0) at testsuite/src/test_gemmtrsm_ukr.c:435 #4 0x000000001000c534 in libblis_test_gemmtrsm_ukr_experiment (params=0x7fffffffe130, op=, iface=, dc_str=, pc_str=, sc_str=, p_cur=, perf=0x7fffffffd040, resid=0x7fffffffd038) at testsuite/src/test_gemmtrsm_ukr.c:355 #5 0x0000000010017d94 in libblis_test_op_driver (tdata=0x101a3e00, params=0x7fffffffe130, op=, iface=, op_str=0x1014a810 "gemmtrsm_ukr", p_types=, o_types=, thresh=0x10190638 , f_exp=0x1000c0d0 <libblis_test_gemmtrsm_ukr_experiment>) at testsuite/src/test_libblis.c:2200 #6 0x000000001000bb7c in libblis_test_gemmtrsm_ukr (tdata=0x101a3e00, params=0x7fffffffe130, op=0x7fffffffde30) at testsuite/src/test_gemmtrsm_ukr.c:151 #7 0x0000000010011b9c in libblis_test_level3_ukrs (tdata=0x101a3e00, params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:311 #8 0x0000000010011d94 in libblis_test_all_ops (tdata=0x101a3e00, params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:230 #9 0x0000000010011e00 in libblis_test_thread_entry (tdata_void=) at testsuite/src/test_libblis.c:115 #10 0x0000000010011f44 in libblis_test_thread_decorator (params=0x7fffffffe130, ops=0x7fffffffd3d0) at testsuite/src/test_libblis.c:176 #11 0x00000000100019e8 in main (argc=, argv=0x7fffffffe5d8) at testsuite/src/test_libblis.c:84

So there is no special TRSM kernel in the power9 directory. I could see only specialized SGEMM and DGEMM kernel for POWER9.

RajalakshmiSR avatar Mar 24 '22 13:03 RajalakshmiSR

Looks like this is happening on POWER9 after https://github.com/flame/blis/commit/ee9ff988c49f16696679d4c6cd3dcfcac7295be7

Having done a git bisect on a POWER9 system I can confirm this too.

Further investigation points at missing modification of ref_kernels/3/bb/bli_gemmtrsmbb_ref.c (parameters m&n are not introduced)

After changing that file similar to https://github.com/flame/blis/commit/ee9ff988c49f16696679d4c6cd3dcfcac7295be7#diff-512c6a50b6244efae7b93e599fb1b295377718d298cac465e3f83db3718c4b03 I see many failures of the kind

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_cgemm1m_nn_ccc                100   100   100     1.34   1.97e-08   PASS
blis_cgemm_nn_ccc                  100   100   100     3.64   1.09e-08   PASS

% blis_<dt><op>_<params>_<stor>      m     n     k   gflops   resid      result
blis_zgemm1m_nn_ccc                100   100   100    17.96   3.82e-03   FAILURE
blis_zgemm_nn_ccc                  100   100   100     4.16   2.25e-17   PASS

I.e. Only the "z" datatype on the "*1m" is affected.

Using the "generic" configuration works.

Flamefire avatar Jul 21 '22 12:07 Flamefire

@fgvanzee the issue seems to be missing edge case handling in gemmtrsmbb_ref. Since you fixed the other trsm/gemmtrsm kernels can you fix this?

devinamatthews avatar Jul 21 '22 15:07 devinamatthews

I'll take a look at it.

fgvanzee avatar Jul 21 '22 15:07 fgvanzee

Uhh, am I missing something, @devinamatthews?

$ ls ref_kernels/3
bli_gemm_ref.c  bli_gemmsup_ref.c  bli_gemmtrsm_ref.c  bli_trsm_ref.c  old

It's not clear what I need to fix.

fgvanzee avatar Jul 21 '22 23:07 fgvanzee

The bb kernels live in some strangebplace. On Jul 21, 2022, at 6:08 PM, "Field G. Van Zee" @.@.>> wrote:

[EXTERNAL SENDER]

Uhh, am I missing something, @devinamatthewshttps://github.com/devinamatthews?

$ ls ref_kernels/3 bli_gemm_ref.c bli_gemmsup_ref.c bli_gemmtrsm_ref.c bli_trsm_ref.c old

— Reply to this email directly, view it on GitHubhttps://github.com/flame/blis/issues/621#issuecomment-1192016426, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABIAZIODBD5UQXDEIJ52MHDVVHJ6HANCNFSM5QM5ICKQ. You are receiving this because you were mentioned.Message ID: @.***>

devinamatthews avatar Jul 21 '22 23:07 devinamatthews

The bb kernels live in some strangebplace.

Oh wait, didn't we fold them into the conventional reference microkernels?

fgvanzee avatar Jul 21 '22 23:07 fgvanzee

@devinamatthews Ah, I think I see the problem now. It's not that the gemmtrsm ukernels lack edge-case handling; it's that they haven't been updated according to the latest way that the bb functionality has been folded in. (By contrast, it appears that the gemm ukernel has been updated. So I'll use that as my guide.)

I'll try to work on this more tomorrow.

fgvanzee avatar Jul 22 '22 00:07 fgvanzee

@fgvanzee gemmtrsm reference kernels has been updated:

	PASTEMAC(ch,bcastbbs_mxn) \
	( \
	  m, \
	  n, \
	  b11, rs_b, cs_b  \
	); \

etc. @Flamefire did you test the tip of the master branch (currently af3a41e)?

devinamatthews avatar Jul 22 '22 14:07 devinamatthews

@devinamatthews Ah, yes, I had previously overlooked this:

    const inc_t rs_b   = packnr; \
    const inc_t cs_b   = bli_cntx_get_blksz_def_dt( dt, BLIS_BBN, cntx ); \

fgvanzee avatar Jul 22 '22 14:07 fgvanzee

I successfully tested af3a41e02534befdae026377592ce437bab83023 on Power9LE

Do you know which commit fixed that?

Flamefire avatar Jul 25 '22 07:07 Flamefire

The Power9/10 kernels use a "broadcast packing" format, which messed up a lot of the older code, or rather, required some bespoke code which interacted poorly with [cz]gemm1m. I rewrote all of the reference packing kernels in ae10d9495486f589ed0320f0151b2d195574f1cf in part to streamline this issue (note that that commit has a bug which was later fixed in 667f201b7871da68622027d02bd6b7da3262f8e8.

I guess we can close the issue then?

devinamatthews avatar Jul 25 '22 15:07 devinamatthews

Yes, thanks!

Flamefire avatar Jul 26 '22 07:07 Flamefire

Thanks for fixing this. make check now shows only these failures on POWER9. Filename: out.zblat3

******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY ******* ******* ZHEMM FAILED ON CALL NUMBER: 274: ZHEMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .

ZSYMM PASSED THE TESTS OF ERROR-EXITS

******* FATAL ERROR - PARAMETER NUMBER 11 WAS CHANGED INCORRECTLY ******* ******* ZSYMM FAILED ON CALL NUMBER: 274: ZSYMM ('R','U', 1, 1,( 1.0, .0), A, 2, B, 2,( .0, .0), C, 2) .

RajalakshmiSR avatar Jul 26 '22 15:07 RajalakshmiSR

@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.

Flamefire avatar Jul 26 '22 15:07 Flamefire

@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:

#include "blis.h"
#include <stdio.h>
#include <string.h>

int main(int argc, char** argv)
{
    obj_t A, B, C, a, b, c;

    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &A );
    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &B );
    bli_obj_create( BLIS_DCOMPLEX, 2, 2, 1, 2, &C );

    bli_setm( &BLIS_ONE, &A );
    bli_setm( &BLIS_ONE, &B );
    bli_setm( &BLIS_ZERO, &C );

    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &A ), 1, 2, &a );
    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &B ), 1, 2, &b );
    bli_obj_create_with_attached_buffer( BLIS_DCOMPLEX, 1, 1, bli_obj_buffer( &C ), 1, 2, &c );
    bli_obj_set_struc( BLIS_HERMITIAN, &a );
    bli_obj_set_uplo( BLIS_UPPER, &a );

    bli_printm( "before:", &C, "%4.1f", "" );
    
    bli_hemm( BLIS_RIGHT, &BLIS_ONE, &a, &b, &BLIS_ZERO, &c );
    
    bli_printm( "after:", &C, "%4.1f", "" );

    return 0;
}

It should print:

before:
 0.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

after:
 1.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

Any change in the zero values in the "after" matrix would be a problem.

devinamatthews avatar Jul 26 '22 21:07 devinamatthews

Not sure if this is relevant, but I just realized this morning that unless @devinamatthews's recent changes (vis-a-vis merging the bb packm functionality into a single reference kernel) extends bb support to triangular packing, any bb-utilizing subconfiguration needs to #define BLIS_DISABLE_TRMM_RIGHT, as is done for power9:

$ grep -R BLIS_DISABLE_TRMM_RIGHT config
./config/old/haswellbb/bli_family_haswell.h:#define BLIS_DISABLE_TRMM_RIGHT
./config/power9/bli_family_power9.h:#define BLIS_DISABLE_TRMM_RIGHT

This macro, as its name suggests, disables right-side trmm, expressing it instead in terms of left-side trmm. This forces the triangular matrix to always be on the left, and therefore a column-preferential gemm microkernel will always be guaranteed to assume broadcast values to exist in the right-hand matrix.

Like power9, power10 also seems to use explicit broadcasting of B during packing, but it fails to set the above macro. That could cause issues for trmm, and possibly non-deterministic behavior elsewhere.

fgvanzee avatar Jul 27 '22 15:07 fgvanzee

And hemm/symm too right?

devinamatthews avatar Jul 27 '22 15:07 devinamatthews

Yes, that's right. hemm/symm also require their own macros (because the Hermitian and symmetric packing formats also would presumably not support explicit broadcast). And trmm3 (for same reasons as trmm).

fgvanzee avatar Jul 27 '22 15:07 fgvanzee

Also just noticed that power10's subconfig registers preferences for complex gemm ukernels without registering any actual complex ukernels. :thinking:

void bli_cntx_init_power10( cntx_t* cntx )
{
    blksz_t blkszs[ BLIS_NUM_BLKSZS ];

    // Set default kernel blocksizes and functions.
    bli_cntx_init_power10_ref( cntx );

    // -------------------------------------------------------------------------

    // Update the context with optimized native gemm micro-kernels.
    bli_cntx_set_ukrs
    (
      cntx,

      // level-3
      BLIS_GEMM_UKR, BLIS_FLOAT,  bli_sgemm_power10_mma_8x16,
      BLIS_GEMM_UKR, BLIS_DOUBLE, bli_dgemm_power10_mma_8x8,

      BLIS_VA_END
    );

    // Update the context with storage preferences.
    bli_cntx_set_ukr_prefs
    (
      cntx,

      // level-3
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_FLOAT,    TRUE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_DOUBLE,   TRUE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_SCOMPLEX, FALSE,
      BLIS_GEMM_UKR_ROW_PREF,   BLIS_DCOMPLEX, FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_FLOAT,    FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_FLOAT,    FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DOUBLE,   FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DOUBLE,   FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_SCOMPLEX, FALSE,
      BLIS_TRSM_L_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,
      BLIS_TRSM_U_UKR_ROW_PREF, BLIS_DCOMPLEX, FALSE,

      BLIS_VA_END
    );

Could be harmless since 1m method code should always read the preference of the real domain ukernel. But nonetheless I think those extra entries should be nixed unless there's something I missing.

EDIT: It would appear that the gemm1m virtual microkernel is doing The Right Thing:tm: by grabbing the preference of the real domain ukernel, so the above is likely not a correctness bug:

    PASTECH(chr,gemm_ukr_ft) \
                      rgemm_ukr = bli_cntx_get_ukr_dt( dt_r, BLIS_GEMM_UKR, cntx ); \
    const bool        col_pref  = bli_cntx_ukr_prefers_cols_dt( dt_r, BLIS_GEMM_UKR, cntx ); \

Although it's still confusing AF to anyone who isn't ~~me and @devinamatthews~~ a core BLIS developer and therefore should be cleaned up.

fgvanzee avatar Jul 27 '22 15:07 fgvanzee

Another observation: power10 uses row-preferential ukernels (which use broadcast values of A, not B) but has broadcast-B values defined in config/power10/bli_kernel_defs_power10.h. :thinking:

It's a wonder any of this power10 code works at all.

fgvanzee avatar Jul 27 '22 15:07 fgvanzee

Okay, I'm going to consult with @nicholaiTukanov, the original author of the power10 subconfig and kernel set, to make sure I understand how things should be registered before I begin suggesting any bugfixes.

fgvanzee avatar Jul 27 '22 16:07 fgvanzee

@RajalakshmiSR there might be a bug where data is written off the end of the output matrix. Please try the following test program:

It should print:

before:
 0.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

after:
 1.0 +  0.0   0.0 +  0.0  
 0.0 +  0.0   0.0 +  0.0  

Any change in the zero values in the "after" matrix would be a problem.

Yes, I could see the same result.

RajalakshmiSR avatar Jul 27 '22 21:07 RajalakshmiSR

@RajalakshmiSR Maybe try compiling with -fstack-protector-strong when using GCC > 7. See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100799 which may be the issue. But just a wild guess.

Still the same issue. Tried it like CFLAGS="-fstack-protector-strong" ./configure power9

RajalakshmiSR avatar Jul 27 '22 21:07 RajalakshmiSR

Yes, I could see the same result.

@RajalakshmiSR strange, that is exactly the test that is supposed to be failing in zblat3. I might write a Fortran version just in case that makes a difference.

devinamatthews avatar Jul 27 '22 23:07 devinamatthews

Here's a Fortran version. It can't get much more similar to the failing test.

program main

double complex A(2,2), B(2,2), C(2,2)
double complex alpha, beta

A(1,1) = (1.0,0.0)
A(2,1) = (1.0,0.0)
A(1,2) = (1.0,0.0)
A(2,2) = (1.0,0.0)

B(1,1) = (1.0,0.0)
B(2,1) = (1.0,0.0)
B(1,2) = (1.0,0.0)
B(2,2) = (1.0,0.0)

C(1,1) = (0.0,0.0)
C(2,1) = (0.0,0.0)
C(1,2) = (0.0,0.0)
C(2,2) = (0.0,0.0)

alpha = (1.0,0.0)
beta = (0.0,0.0)

WRITE(*,*)'before:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)

CALL ZHEMM('R','U',1,1,alpha,A,2,B,2,beta,C,2)

WRITE(*,*)'after:'
WRITE(*,'(2(A,F4.1,A,F4.1,A,X))')(('(',REAL(C(I,J)),',',AIMAG(C(I,J)),')',J=1,2),I=1,2)
WRITE(*,*)

END PROGRAM MAIN

devinamatthews avatar Jul 27 '22 23:07 devinamatthews

@ivan23kor I fixed what was probably a power10 bug in 5b29893, although I can't be sure how it would have manifested. If you can, please give this commit a try. (We don't have easy access to any Power hardware.)

fgvanzee avatar Jul 28 '22 00:07 fgvanzee

Sorry these initial questions weren't answered in a timely manner.

POWER10 build fails because POWER10 microkernels use the old microkernel API without edge handling (addressed in https://github.com/flame/blis/pull/620).

@ivan23kor Thank you for catching that, and submitting the PR fix.

Also, type nibble is defined in the sandbox but used for i4 microkernel so any POWER10 user has to configure BLIS with -s power10. Is it intentional?

For now, yes. That code could be transitioned from being a sandbox to being an addon, but even then you'd have to explicitly request it via configure.

POWER9 uses slow reference implementations for sgemm, cgemm, zgemm by default, are there plans to support a fast sgemm microkernel for POWER9?

Not at this time. We don't have any Power expertise or hardware access in our core development group. The collaborator who wrote the power9 dgemm microkernel has moved on to other adventures, although he still stays involved in our community. Care to comment, @nicholaiTukanov?

fgvanzee avatar Jul 29 '22 17:07 fgvanzee