ComplexF32 eigen can return `NaN` unexpectedly
Describe the bug
There seems to be an issue with the stability of the eigen function with ComplexF32. Occasionally the eigen code will return NaN which is inconsistent with the CPU decomposition.
To reproduce
The Minimal Working Example (MWE) for this bug:
using CUDA, HDF5, LinearAlgebra
fid = h5open("broken_eigen.h5", "r")
m = read(fid, "matrix")
m = Hermitian(m)
cm = Hermitian(cu(m))
D, V = eigen(m)
cuD, cuV, eigen(cm)
close(fid)
broken_eigen.h5.txt ** Please note that this file is a .h5 file but I saved it as a txt because it would not let me post here just remove the .txt extension.
Manifest.toml
Status `~/.julia/environments/v1.9/Project.toml`
[052768ef] CUDA v5.1.1
[34da2185] Compat v4.10.0
[f67ccb44] HDF5 v0.17.1
[33e6dc65] MKL v0.6.1
Expected behavior
I would expect cuD and cuV to be the eigen values and eigen vectors of the CuMatrix cm which has values between [-4.6161222f-8, 0.8686561f0] with an absolute minimum value of 1.3966348f-25
Version info
Details on Julia:
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
Threads: 1 on 32 virtual cores
Details on CUDA:
CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 535.113.1, originally for CUDA 12.2
CUDA libraries:
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.113.1
Julia packages:
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1
Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6
1 device:
0: NVIDIA RTX A6000 (sm_86, 45.964 GiB / 47.988 GiB available)
I can reproduce, but I'm not familiar with the eigen/heevd, so pinging a couple of people who were involved with this code and may be able to say something useful: @albertomercurio @GVigne. It's possible that this is a bug in NVIDIA's libraries, but I want to make sure we're not doing anything wrong before filing an issue.
I can also reproduce the problem. With ComplexF64 everything works. It seems something related to heevd