OpenBLAS HWCAP_CPUID should not be used to detect CPU uarch on aarch64

Ref https://github.com/xianyi/OpenBLAS/issues/2696#issuecomment-652762539 since the issue was closed.

This is because with big-LITTLE and alike the CPUID result for MIDR_EL1 is unstable and it depends on the core the code is running on. You can easily get the wrong (/unwanted) result if you happen to run on a little core.

The correct way is to read the /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 file.

This should replace https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/driver/others/dynamic_arm64.c#L150 and should also be preferred over the /proc/cpuinfo reader in https://github.com/xianyi/OpenBLAS/blob/a9aeb6745cb5bb044bea4a6f078458d749158fa6/cpuid_arm64.c#L123

Jul 11 '20 14:07 yuyichao

Closed but not forgotten - I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ? (Currently it can happen that a simple make builds for the weaker core, but a subsequent make lapack-test runs on a big core that got woken up due to the compiler workload and goes looking for the wrong library name to link to.) In the dynamic_arch runtime situation you mention, it is not clear to me how the current solution is wrong, as I assume we'd want to identify the cpu we are currently running on rather than whatever else is hypothetically available as well ? If the task gets rescheduled between big and little cores I guess we'd need to find a way to catch this and ask anew whenever it happens (?)

Jul 11 '20 15:07 martin-frbg

Closed but not forgotten

That issue isn't about this anyway and I'm not trying to say that one shouldn't have been closed. Just want to make sure there is one open issue tracking this..

I am unsure how to handle this at build time, perhaps best to do a DYNAMIC_ARCH build for at least the two types when a big.little configuration is encountered ?

I don't think there's any difference between build time and runtime here. I would either go for one that's optimized for both big and LITTLE or just one that's optimized for the big.

In the dynamic_arch runtime situation you mention, it is not clear to me how the current solution is wrong, as I assume we'd want to identify the cpu we are currently running on rather than whatever else is hypothetically available as well ? If the task gets rescheduled between big and little cores I guess we'd need to find a way to catch this and ask anew whenever it happens (?)

AFAICT the moving between cores isn't something the userspace should care about. Running on the LITTLE core means that the OS doesn't think it's doing much work and running on the big core means the process is probably doing a lot of work. That's why I think it makes sense to simply optimize for the big core, the performance matters much less when running on the LITTLE core. Now with enough workload I assume both the big and the LITTLE cores are going to be used, so there's probably a slightly different target to optimize for (GCC seems to have some target like that, but I'm not sure if there's enough gain to worth the effort). However, I still think the big core perforrmance is more important in those cases and my faith is in that the pairing core are going to be "compatible" enough that code optimized for the big isn't going to be that bad on the LITTLE.

Also, AFAICT, the detection is done on only one thread so I the result is going to be wrong for some thread anyway if one only make the decision based on current core. (and HWCAP_CPUID uses kernel emulation so I don't think doing detection on worker threads every time is an option...)

Jul 11 '20 16:07 yuyichao

Ok, I was unsure if code would simply crash when optimized for big and running on little but that would certainly be an unhealthy platform then. OTOH it seems the /sys/devices path depends on kernel version or something - I do not have the path you suggested on a 3.10 kernel (termux on an android8 device)

Jul 11 '20 16:07 martin-frbg

simply crash when optimized for big and running on little

The crash can only be caused by feature mismatch which should use HWCAP_ to detect. Even accessing the hardware register for feature detection should also be fine since I think the kernel normalizes it (not 100% sure).

Yes, that is a 4.7 kernel feature and it's added for exactly this reason. If the file doesn't exist, the alternaive should be parsing of /proc/cpuinfo, which also contains info for all cores.

Jul 11 '20 16:07 yuyichao

On 3.10, it enumerates the cores but prints out only one (variable) "CPU part" though.

Jul 11 '20 17:07 martin-frbg

If the kernel doesn't provide anything better then that'll have to be whar you work with of course... It's still better than cpuid since it should be more stable.

Also note that at least for mainline kernel, HWCAP_CPUID is a 4.11 feature.

Jul 11 '20 17:07 yuyichao

Ah, ok. Iterating over all cores to determine the most capable type sounds a bit PITA though.

Jul 11 '20 17:07 martin-frbg

REF: https://www.kernel.org/doc/Documentation/arm64/cpu-feature-registers.txt The mainline kernel 4.11 is exceeded only in Android 10, it will take 5-10 years for older kernels to exit mainstream usage.

Jul 11 '20 18:07 brada4

Iterating over all cores to determine the most capable type sounds a bit PITA though.

That's the only way to get a stable result.

The mainline kernel 4.11 is exceeded only in Android 10, it will take 5-10 years for older kernels to exit mainstream usage.

Well, that's unrelated to this issue. FWIW, what I'm asking for here is to lower the kernel version requirement. The feature that is intended for this is in 4.7, and the fallback of /proc/cpuinfo is there for all versios.

Jul 11 '20 18:07 yuyichao

I wonder if this should be left to the developer to set instead of trying to automate it under the hood. Since we can already set thread affinity for blas threads, maybe the approach is that this is done only for the relevant subset (little/big), where what's relevant would be developers choice based on the application.

So, the high-level workflow would be to determine which cores you'd want the openblas to run on, set affinity for blas thread to those cores, and then pick the core type accordingly.

It looks like https://github.com/google/cpu_features works mainly via /proc/cpuinfo (based on the README). @yuyichao is there a reason you prefer the approach via /sys/devices/system/cpu/cpu<n>/regs/identification/midr_el1 ?

Oct 21 '20 18:10 ognjentodic

I wonder if this should be left to the developer to set instead of trying to automate it under the hood

Allowing it via something better than an environment variable is good but that must not replace autodetection. Most users will not have a better collection of CPU architectures to feed into the library (I do, but that's the exception).

is there a reason you prefer the approach via /sys/devices/system/cpu/cpu/regs/identification/midr_el1 ?

Yes absolutely. This is the whole reason the sysfs file was added. /proc/cpuinfo is never meant to be machine parsed but uses a human readable format that has changed over the years. It's much simpler and more reliable to use the newer interface. A fallback is certainly still OK.

Oct 21 '20 21:10 yuyichao

I'm not entirely sure if it is relevant for this discussion but: It appears that the current code (after #2952) causes SIGILL on NVIDIA Jetson devices because the HWCAP_CPUID seems to have the checked bit set but the get_cpu_ftr trigerrs SIGILL nonetheless. This is observed e.g. with the Debian testing/unstable packages of openblas. I'd be happy to investigate more, but from what I can see the /sys method would work.

Apr 15 '21 08:04 t-vi

That is/was supposed to be fixed with #3060 but Debian has probably not updated beyond 0.3.13 yet.

Apr 15 '21 09:04 martin-frbg

@martin-frbg Thank you for the pointer. When I try to look at their 0.3.14 version, it seems to be missing that patch, too?! I'll file a bug with Debian.

Apr 15 '21 09:04 t-vi

That fix is definitely in 0.3.14 (which should be superseded by 0.3.15 "shortly"), and according to the numpy ticket linked from #3060 it solved the problem with Jetson (I do not have Jetson hardware). As a workaround you can set OPENBLAS_CORETYPE=ARMV8 to avoid using the autodetection mechanism at runtime

Apr 15 '21 09:04 martin-frbg

Yes, thank you. It'd be a shame if the upcoming Debian release would be missing the fix. I filed it as Debian Bug 986996 on their system.

Apr 15 '21 10:04 t-vi

My fault - fix overwritten by a subsequent PR from a different branch. Restoring...

Apr 16 '21 08:04 martin-frbg

I'm glad I didn't just waste your time on something that was completely and obviously fixed. Thank you again for your work on OpenBLAS and your help with this.

Apr 16 '21 08:04 t-vi

Thanks for alerting me to my blunder. Pity this could not show up in the CI runs

Apr 16 '21 09:04 martin-frbg