lapack icon indicating copy to clipboard operation
lapack copied to clipboard

Relatively poor performance of ZHEEV (ZLASR)

Open martin-frbg opened this issue 3 years ago • 1 comments

Some time ago in https://github.com/xianyi/OpenBLAS/issues/1560 it was observed that the performance of ZHEEV is markedly poorer than in MKL, and by now I am reasonably certain that OpenBLAS is not at fault. The primary problem appears to be the slowness of the ZLASR call, which interestingly does not appear to feature in perf traces of a corresponding MKL-based testcase at all.

Does anybody happen to work on this topic currently ? I know PR #83 added a 2stage implementation (apparently by someone from the Dongarra group) in late 2016, but this does "not yet" implement computation of the eigenvectors and there have been little functional changes since then. The implementation of ZLASR on the other hand appears to predate LAPACK 3.2.1.

martin-frbg avatar Aug 14 '22 16:08 martin-frbg

ZLASR applies one plane rotation at a time. A common technique is to apply loop blocking and apply several Givens rotations simultaneously.

For example, let's look at the first case realized in ZLASR, pivot == 'V', direct == 'F'. The code computes image

With loop blocking by a factor of 2, this becomes image

The product of the two plane rotations and the matrix is equivalent to image

The products of the components of the plane rotations [for example, in the first row s(j+1)c(j) and s(j+1)s(j)] can be computed outside of the update loop and stored in a temporary variable. A Fortran code could look like

               DO 20 j = 1, m - 1, 2
                   ! Assume m-1 is even
                   tmp1 = s(j+1)*c(j)
                   tmp2 = s(j+1)*s(j)
                   tmp3 = c(j+1)*c(j)
                   tmp4 = c(j+1)*s(j)
                   stemp = s( j )
                      DO 10 i = 1, n
                         a2 = a( j+2, i )
                         a1 = a( j+1, i )
                         a0 = a( j, i )
                         a( j+2, i ) = c(j+1)*a2 - tmp1*a1 + tmp2*a0
                         a( j+1, i ) = s(j+1)*a2 + tmp3*a1 - tmp4*a0
                         a( j, i ) = s(j)*a1 + c(j)*a0
    10                CONTINUE
    20       CONTINUE

This saves traversals over the matrix and flops.

angsch avatar Aug 14 '22 18:08 angsch

Thanks for the detailed explanation - will try to create a replacement zlasr for OpenBLAS when I find the time. (Sorry for the very late response)

martin-frbg avatar Oct 07 '22 08:10 martin-frbg