BatchED Getting NaNs

Hi, When running inside network training, after a while with some probability i'm getting nan for one instance in the batch. All the eigen values and the eigen vecotrs are nan. The matrix batch in this case is 1024X64X64 and it is real and symmetric. I'm using ed_plus. It seems to have a relation to the batch size, as the same matrix that caused the nan, doesn't nan when taken alone or with a smaller batch. Ant idea of where I should investigate this? Thanks!

Mar 30 '23 23:03 assafshocher

Hi, Thanks for your interest!

Which solver are you using now, BatchedED or BatchedEDplus?

One likely reason is that the matrices are getting very ill-conditioned. In this way some eps value in the code should be tuned.

Another suggestion is to make your matrices strictly positive definite, i.e., A=A+lambda*I where I is the identity matrix and lambda is the strength (e.g. 1e-3). This can greatly solve the illcondition issue.

Please let me know if this can help.

Mar 31 '23 06:03 KingJamesSong

Thanks for replying! I'm using BatchedEDplus because of the size. Should I be using BatchedED?

Re ill-conditioned: when testing the same matrix out of the batch, or even when just taking half of the batch (that includes that matrix) it succeeds. The matrix is positive semi definite. There is one eigen-val that is zero.

It seems that the size of the batch influences it. Which also means the size of the batch changes the eigen decomposition on a matrix in the batch which I guess is an unwanted behavior.

Thanks in advance!

Mar 31 '23 21:03 assafshocher

mat.pt.zip Here is an example tensor that fails. To reproduce, first extract it and then:

mat = torch.load('mat.pt')
eig_vecs, eig_diag = batched_ed_plus(mat)
eig_vecs_.isnan().nonzero()
# (result shows all eigen vecs for instance 395 in the batch are NaN)
# tensor([[395,   0,   0],
#        [395,   0,   1],
#        [395,   0,   2],
#        ...,
#        [395,  63,  61],
#        [395,  63,  62],
#        [395,  63,  63]], device='cuda:0')

# next we take only he firs half of the batch
eig_vecs, eig_diag = batched_ed_plus(mat[:512])
eig_vecs_.isnan().nonzero()
# result shown non NaNs this time.
# tensor([], device='cuda:0', size=(0, 3), dtype=torch.int64)

Mar 31 '23 22:03 assafshocher

Thanks for reporting this issue!

For sure it is more appropriate to use BatchedEDPlus since the matrix is of dimension 64.

I agree it is the batch operation that causes the NaN and I think it is due to the eps value in the following lines:

https://github.com/KingJamesSong/BatchED/blob/main/utils_ed_plus.py#L158 https://github.com/KingJamesSong/BatchED/blob/main/utils_ed_plus.py#L180 https://github.com/KingJamesSong/BatchED/blob/main/utils_ed_plus.py#L304

Could you please try to lower the eps value from 1e-5 to smaller numbers such as 1e-10? These eps values control when to continue decomposing the matrix when the maximal value in the batch is smaller than the tolerance. So these values should be related to this issue.

Sorry for the problem and please let me know if this can help solve the problem.

Apr 01 '23 09:04 KingJamesSong

Thanks for your response! Decreasing epsilon in those lines only seemed to make it worse (when training the model encountered NaN much earlier). I tried increasing it to e-3 which made it survive a bit longer, but still got NaN an epoch later. Any ideas? I have provided above a batch for which this is happening. Thanks!

Apr 10 '23 20:04 assafshocher

Thanks for the update.

This confirms my guess that the issue is indeed related to the batch-based operation. Another possible issue is the backward gradients of SVD/EIG, which is known to be unstable. Do you use backward gradients implemented in the code to avoid the gradient clipping issue?

I will investigate this issue further and have a look at the data you provided, but I cannot say when I would solve the issue and come back.

If you are doing matrix functions such as matrix square root, I suggest you to look at the code this repository (which should be fast and stable): https://github.com/KingJamesSong/FastDifferentiableMatSqrt. Otherwise I do not have a instant solution.

Apr 11 '23 09:04 KingJamesSong

Thank you for replying so quickly, also thanks for looking into it! As far as I understand the the implemented backward gradients should automatically take effect, as you defined the backward function. Am I wrong? Anyway, if you have any other tips for things to investigate in order to improve the stability it would be helpful. Thanks!

Apr 11 '23 17:04 assafshocher

Yes, it would automatically take effect!

I would let you know when I solve this issue.

Apr 11 '23 17:04 KingJamesSong

Thanks! One more point: I don't think it's the gradients. As far as I see the NaN occurs in the forward pass. Just applying the decomposition to a certain batch, you get NaNs for all eigen vals and eigen vecs of a certain instance.

Apr 11 '23 17:04 assafshocher