Relatively poor performance of ZHEEV (ZLASR)
Some time ago in https://github.com/xianyi/OpenBLAS/issues/1560 it was observed that the performance of ZHEEV is markedly poorer than in MKL, and by now I am reasonably certain that OpenBLAS is not at fault. The primary problem appears to be the slowness of the ZLASR call, which interestingly does not appear to feature in perf traces of a corresponding MKL-based testcase at all.
Does anybody happen to work on this topic currently ? I know PR #83 added a 2stage implementation (apparently by someone from the Dongarra group) in late 2016, but this does "not yet" implement computation of the eigenvectors and there have been little functional changes since then. The implementation of ZLASR on the other hand appears to predate LAPACK 3.2.1.
ZLASR applies one plane rotation at a time. A common technique is to apply loop blocking and apply several Givens rotations simultaneously.
For example, let's look at the first case realized in ZLASR, pivot == 'V', direct == 'F'. The code computes

With loop blocking by a factor of 2, this becomes

The product of the two plane rotations and the matrix is equivalent to

The products of the components of the plane rotations [for example, in the first row s(j+1)c(j) and s(j+1)s(j)] can be computed outside of the update loop and stored in a temporary variable. A Fortran code could look like
DO 20 j = 1, m - 1, 2
! Assume m-1 is even
tmp1 = s(j+1)*c(j)
tmp2 = s(j+1)*s(j)
tmp3 = c(j+1)*c(j)
tmp4 = c(j+1)*s(j)
stemp = s( j )
DO 10 i = 1, n
a2 = a( j+2, i )
a1 = a( j+1, i )
a0 = a( j, i )
a( j+2, i ) = c(j+1)*a2 - tmp1*a1 + tmp2*a0
a( j+1, i ) = s(j+1)*a2 + tmp3*a1 - tmp4*a0
a( j, i ) = s(j)*a1 + c(j)*a0
10 CONTINUE
20 CONTINUE
This saves traversals over the matrix and flops.
Thanks for the detailed explanation - will try to create a replacement zlasr for OpenBLAS when I find the time. (Sorry for the very late response)