CUDA.jl icon indicating copy to clipboard operation
CUDA.jl copied to clipboard

ComplexF32 eigen can return `NaN` unexpectedly

Open kmp5VT opened this issue 2 years ago • 2 comments

Describe the bug

There seems to be an issue with the stability of the eigen function with ComplexF32. Occasionally the eigen code will return NaN which is inconsistent with the CPU decomposition.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA, HDF5, LinearAlgebra
fid = h5open("broken_eigen.h5", "r")
m = read(fid, "matrix")
m = Hermitian(m)
cm = Hermitian(cu(m))
D, V = eigen(m)
cuD, cuV, eigen(cm)
close(fid)

broken_eigen.h5.txt ** Please note that this file is a .h5 file but I saved it as a txt because it would not let me post here just remove the .txt extension.

Manifest.toml

Status `~/.julia/environments/v1.9/Project.toml`
  [052768ef] CUDA v5.1.1
  [34da2185] Compat v4.10.0
  [f67ccb44] HDF5 v0.17.1
  [33e6dc65] MKL v0.6.1

Expected behavior

I would expect cuD and cuV to be the eigen values and eigen vectors of the CuMatrix cm which has values between [-4.6161222f-8, 0.8686561f0] with an absolute minimum value of 1.3966348f-25

Version info

Details on Julia:

Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, cascadelake)
  Threads: 1 on 32 virtual cores

Details on CUDA:

CUDA runtime 12.3, artifact installation
CUDA driver 12.3
NVIDIA driver 535.113.1, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.3.2
- CURAND: 10.3.4
- CUFFT: 11.0.11
- CUSOLVER: 11.5.3
- CUSPARSE: 12.1.3
- CUPTI: 21.0.0
- NVML: 12.0.0+535.113.1

Julia packages: 
- CUDA: 5.1.0
- CUDA_Driver_jll: 0.7.0+0
- CUDA_Runtime_jll: 0.10.0+1

Toolchain:
- Julia: 1.9.4
- LLVM: 14.0.6

1 device:
  0: NVIDIA RTX A6000 (sm_86, 45.964 GiB / 47.988 GiB available)

kmp5VT avatar Nov 30 '23 19:11 kmp5VT

I can reproduce, but I'm not familiar with the eigen/heevd, so pinging a couple of people who were involved with this code and may be able to say something useful: @albertomercurio @GVigne. It's possible that this is a bug in NVIDIA's libraries, but I want to make sure we're not doing anything wrong before filing an issue.

maleadt avatar Jan 02 '24 10:01 maleadt

I can also reproduce the problem. With ComplexF64 everything works. It seems something related to heevd

albertomercurio avatar Jan 13 '24 08:01 albertomercurio