lapack Relatively poor performance of ZHEEV (ZLASR)

Some time ago in https://github.com/xianyi/OpenBLAS/issues/1560 it was observed that the performance of ZHEEV is markedly poorer than in MKL, and by now I am reasonably certain that OpenBLAS is not at fault. The primary problem appears to be the slowness of the ZLASR call, which interestingly does not appear to feature in perf traces of a corresponding MKL-based testcase at all.

Does anybody happen to work on this topic currently ? I know PR #83 added a 2stage implementation (apparently by someone from the Dongarra group) in late 2016, but this does "not yet" implement computation of the eigenvectors and there have been little functional changes since then. The implementation of ZLASR on the other hand appears to predate LAPACK 3.2.1.

Aug 14 '22 16:08 martin-frbg

ZLASR applies one plane rotation at a time. A common technique is to apply loop blocking and apply several Givens rotations simultaneously.

For example, let's look at the first case realized in ZLASR, pivot == 'V', direct == 'F'. The code computes

With loop blocking by a factor of 2, this becomes

The product of the two plane rotations and the matrix is equivalent to

The products of the components of the plane rotations [for example, in the first row s(j+1)c(j) and s(j+1)s(j)] can be computed outside of the update loop and stored in a temporary variable. A Fortran code could look like

               DO 20 j = 1, m - 1, 2
                   ! Assume m-1 is even
                   tmp1 = s(j+1)*c(j)
                   tmp2 = s(j+1)*s(j)
                   tmp3 = c(j+1)*c(j)
                   tmp4 = c(j+1)*s(j)
                   stemp = s( j )
                      DO 10 i = 1, n
                         a2 = a( j+2, i )
                         a1 = a( j+1, i )
                         a0 = a( j, i )
                         a( j+2, i ) = c(j+1)*a2 - tmp1*a1 + tmp2*a0
                         a( j+1, i ) = s(j+1)*a2 + tmp3*a1 - tmp4*a0
                         a( j, i ) = s(j)*a1 + c(j)*a0
    10                CONTINUE
    20       CONTINUE

This saves traversals over the matrix and flops.

Aug 14 '22 18:08 angsch

Thanks for the detailed explanation - will try to create a replacement zlasr for OpenBLAS when I find the time. (Sorry for the very late response)

Oct 07 '22 08:10 martin-frbg