PyTorch-Model-Compare icon indicating copy to clipboard operation
PyTorch-Model-Compare copied to clipboard

AssertionError: HSIC computation resulted in NANs

Open bryanbocao opened this issue 2 years ago • 16 comments

I tried comparing many EfficientNet to other models (and its variants), but all I got is this error: AssertionError: HSIC computation resulted in NANs. One example:

python3 eff_b0b2_compare.py

eff_b0b2_compare.py:

import torch
from torchvision.models import efficientnet_b0, efficientnet_b2 # edit
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
import numpy as np
import random
from torch_cka import CKA

def seed_worker(worker_id):
    worker_seed = torch.initial_seed() % 2**32
    np.random.seed(worker_seed)
    random.seed(worker_seed)

g = torch.Generator()
g.manual_seed(0)
np.random.seed(0)
random.seed(0)

model1_name, model2_name = 'efficientnet_b0', 'efficientnet_b2' # edit
model1 = efficientnet_b0(pretrained=True) # edit
model2 = efficientnet_b2(pretrained=True) # edit

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))])

batch_size = 16 # 256

dataset = CIFAR10(root='../data/',
                  train=False,
                  download=True,
                  transform=transform)

dataloader = DataLoader(dataset,
                        batch_size=batch_size,
                        shuffle=False,
                        worker_init_fn=seed_worker,
                        generator=g,)

cka = CKA(model1, model2,
        model1_name=model1_name, model2_name=model2_name,
        device='cuda')

cka.compare(dataloader)

cka.plot_results(save_path="../exps/{}.jpg".format(model1_name, model2_name))
/home/brcao/.local/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/brcao/.local/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=EfficientNet_B0_Weights.IMAGENET1K_V1`. You can also use `weights=EfficientNet_B0_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
/home/brcao/.local/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=EfficientNet_B2_Weights.IMAGENET1K_V1`. You can also use `weights=EfficientNet_B2_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Files already downloaded and verified
/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py:62: UserWarning: Model 1 seems to have a lot of layers. Consider giving a list of layers whose features you are concerned with through the 'model1_layers' parameter. Your CPU/GPU will thank you :)
  warn("Model 1 seems to have a lot of layers. " \
/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py:69: UserWarning: Model 2 seems to have a lot of layers. Consider giving a list of layers whose features you are concerned with through the 'model2_layers' parameter. Your CPU/GPU will thank you :)
  warn("Model 2 seems to have a lot of layers. " \
/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py:145: UserWarning: Dataloader for Model 2 is not given. Using the same dataloader for both models.
  warn("Dataloader for Model 2 is not given. Using the same dataloader for both models.")
| Comparing features |:  28                                                                                    | Comparing features |:  32%|▎| 13                 | Comparing features |:  35%|▎| 14                                                                                                     | Comparing features |:  38%|▍| 15                                  | Comparing features |: 100%|██| 40/40 [3:43:19<00:00, 335.00s/it]^[[B^[[A^[[B^[[A^[[B
Traceback (most recent call last):
  File "eff_b0b2_compare.py", line 45, in <module>
    cka.compare(dataloader)
  File "/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py", line 183, in compare
    assert not torch.isnan(self.hsic_matrix).any(), "HSIC computation resulted in NANs"
AssertionError: HSIC computation resulted in NANs

Any help would be great. Thanks!

bryanbocao avatar Mar 11 '23 16:03 bryanbocao

Not sure if this helps: for my case, I increased the batch_size from 1 to 16, and the error went away. Maybe you can try it? From 16 to, say, 32 or 64.

Yash-10 avatar May 06 '23 18:05 Yash-10

Hi @Yash-10, thanks for your reply! I used batch_size = 16 in the previous example.

bryanbocao avatar May 06 '23 18:05 bryanbocao

I am sorry; I meant that for my own application (different from yours), I increased the batch size and the error disappeared. Since you used batch_size = 16, I wondered if increasing it to 32/64 might remove the error.

Yash-10 avatar May 06 '23 19:05 Yash-10

No worries! Thanks for your help! Running with batch_size = 32, 64 and 128 now. Will post the results when finished.

BTW, it seems to take hours to finish. I am using RTX 3090 and the above scripts take 19, 20 and 21 GB GPU memory. Hope the time and memory spent are normal here.

bryanbocao avatar May 06 '23 19:05 bryanbocao

@Yash-10 Sorr, I tried batch size 32, 64 and 128 but still got the same results:

python3 eff_b0b2_compare.py 
/home/brcao/.local/lib/python3.8/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/home/brcao/.local/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=EfficientNet_B0_Weights.IMAGENET1K_V1`. You can also use `weights=EfficientNet_B0_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
/home/brcao/.local/lib/python3.8/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=EfficientNet_B2_Weights.IMAGENET1K_V1`. You can also use `weights=EfficientNet_B2_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Files already downloaded and verified
/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py:62: UserWarning: Model 1 seems to have a lot of layers. Consider giving a list of layers whose features you are concerned with through the 'model1_layers' parameter. Your CPU/GPU will thank you :)
  warn("Model 1 seems to have a lot of layers. " \
/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py:69: UserWarning: Model 2 seems to have a lot of layers. Consider giving a list of layers whose features you are concerned with through the 'model2_layers' parameter. Your CPU/GPU will thank you :)
  warn("Model 2 seems to have a lot of layers. " \
/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py:145: UserWarning: Dataloader for Model 2 is not given. Using the same dataloader for both models.
  warn("Dataloader for Model 2 is not given. Using the same dataloader for both models.")
| Comparing features |: 100%|█| 79/79 [15:07:33<00:00, 689
Traceback (most recent call last):
  File "eff_b0b2_compare.py", line 45, in <module>
    cka.compare(dataloader)
  File "/home/brcao/.local/lib/python3.8/site-packages/torch_cka/cka.py", line 183, in compare
    assert not torch.isnan(self.hsic_matrix).any(), "HSIC computation resulted in NANs"
AssertionError: HSIC computation resulted in NANs

bryanbocao avatar May 09 '23 22:05 bryanbocao

@Yash-10 Sorry, still got the nan errors with batch_size = 32, 64 and 128.

PyTorch 1.13.1+cu117 NVIDIA-SMI 470.161.03 Driver Version: 470.161.03 CUDA Version: 11.4

bryanbocao avatar May 12 '23 01:05 bryanbocao

I'm not sure what's causing it but the problem seems to occur whenever using efficientnet, mobilenet or custom implemented resnets. It works fine when using the torch.models.resnet models.

wmabebe avatar Jul 21 '23 01:07 wmabebe

I find that sometimes HSIC computation will cause negative, which may cause the final sqrt computation to get NaN. I've tried L1, L2 norm, but still meet the error.

ImmortalSdm avatar Sep 12 '23 17:09 ImmortalSdm

I found a workaround for this problem by using the model_layers argument. I temporarily removed the assert statement and plotted the figure, which showed that only some layers had nan values. Then I excluded those layers from the computation.

zjcqn avatar Nov 13 '23 08:11 zjcqn

In my case, when I turned off amp, the NaN issue did not occur. float32 was the solution.

dhkim0225 avatar Feb 19 '24 01:02 dhkim0225

In my case, when I turned off amp, the NaN issue did not occur. float32 was the solution.

How to turn of amp? Thanks!

bryanbocao avatar Feb 19 '24 04:02 bryanbocao

Precisely, I applied torch.no_grad and amp as follows,

with torch.no_grad(), torch.cuda.amp.autocast():
    cka = CKA(model1, model2,
                      model1_name=model1_name
                      model2_name=model2_name 
                      model1_layers=layer_names1,
                      model2_layers=layer_names2,
                      device='cuda')
    cka.compare(dataloader)

However NaN was occured, so I modified it to not apply amp, and all problems were resolved.

dhkim0225 avatar Feb 20 '24 03:02 dhkim0225

@dhkim0225 Thank you for your input! So eventually it will be like

with torch.no_grad(), torch.cuda.amp.autocast(enabled=False):
    cka = CKA(model1, model2,
                      model1_name=model1_name
                      model2_name=model2_name 
                      model1_layers=layer_names1,
                      model2_layers=layer_names2,
                      device='cuda')
    cka.compare(dataloader)

I am still trying to understand the root cause.

bryanbocao avatar Feb 20 '24 03:02 bryanbocao

@dhkim0225 Thank you for your input! So eventually it will be like

with torch.no_grad(), torch.cuda.amp.autocast(enabled=False):
    cka = CKA(model1, model2,
                      model1_name=model1_name
                      model2_name=model2_name 
                      model1_layers=layer_names1,
                      model2_layers=layer_names2,
                      device='cuda')
    cka.compare(dataloader)

I am still trying to understand the root cause.

Hello ,I wanna know if u have the right code. I also meet this problem! Thank u

HaomingX avatar Mar 20 '24 09:03 HaomingX

@HaomingX I suspect probably it is due to the explosion of computation (e.g. gradients) that leads to nan. Try to reduce the amount of computations needed such as layers may help.

bryanbocao avatar Mar 20 '24 17:03 bryanbocao

@HaomingX I suspect probably it is due to the explosion of computation (e.g. gradients) that leads to nan. Try to reduce the amount of computations needed such as layers may help.

yep, I found that only choosing the early layers is ok today. Thank u.

HaomingX avatar Mar 20 '24 17:03 HaomingX