EfficientAT icon indicating copy to clipboard operation
EfficientAT copied to clipboard

CUFFT_INTERNAL_ERROR when finetuning

Open lukeandsee opened this issue 11 months ago • 4 comments

Hi! I'm attempting to finetune on openmic, I have the dataset and am running the recommended:

python ex_openmic.py --cuda --train --pretrained --model_name=dymn10_as --lr=2e-5 --batch_size=32

I get:

Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_train.csv_mp3.hdf with length 14915.
Mixing up waveforms from dataset of len 14915
Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_test.csv_mp3.hdf with length 5085.
Epoch 1/80: mAP: nan, val_loss: nan:   0%|                                                                     | 0/467 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 254, in <module>
    train(args)
  File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 100, in train
    x = _mel_forward(x, mel)
  File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 155, in _mel_forward
    x = mel(x)
  File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ltaylor/data/sonar/EfficientAT/models/preprocess.py", line 42, in forward
    x = torch.stft(x, self.n_fft, hop_length=self.hopsize, win_length=self.win_length,
  File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/functional.py", line 632, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR

I have a suspicion that it may be an incompatibility thing and I just have a different set of pytorch, CUDA etc than you. Would you kindly share the full set of packages for which this works? pip freeze > requirements.txt In particular I have pytorch 1.13.0+cu117

lukeandsee avatar Mar 11 '25 18:03 lukeandsee

Hi @lukeandsee

I installed a fresh setup and for me it worked out of the box with the following attached requirements.

requirements_openmic.txt

However, I think that a corrupted dataset is more likely.

Best, Florian

fschmid56 avatar Mar 18 '25 09:03 fschmid56

hey, thanks for reply! I've tried redownloading the dataset and with a fresh virtualenv with those packages you send and adding some code to print a sample of the input tensor (to make sure it's not all 0s or something)

          print(*values: "x shape:",*values: x.shape)                                                                                                                                                              
          print(*values: "x sample", *values: x[*values: :4,*values: :4])                                                                                                                                          
          print(*values: "n_fft", *values: self.n_fft)                                                                                                                                                             
          print(*values: "n_fft", *values: self.n_fft)                                                                                                                                                             
          print(*values: "hopsize", *values: self.hopsize)                                                                                                                                                         
          print(*values: "win_length", *values: self.win_length)                                                                                                                                                              print(*values: "window", *values: self.window[*values: :4])

and I see:

Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_train.csv_mp3.hdf with length 14915.
Mixing up waveforms from dataset of len 14915
Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_test.csv_mp3.hdf with length 5085.
Epoch 1/80: mAP: nan, val_loss: nan:   0%|                                     | 0/467 [00:00<?, ?it/s]x shape: torch.Size([32, 319999])
x sample tensor([[ 4.0114e-03,  4.0245e-03,  5.0103e-03,  6.6861e-03],
        [ 1.4319e-08, -2.2448e-09, -1.6799e-08, -1.8434e-09],
        [ 2.8969e-02,  3.1241e-02,  2.8138e-02,  2.4127e-02],
        [-4.7662e-02, -4.5411e-02,  1.3548e-02, -1.3909e-02]], device='cuda:0')
n_fft 1024
n_fft 1024
hopsize 320
win_length 800
window tensor([0.0000e+00, 1.5467e-05, 6.1840e-05, 1.3915e-04], device='cuda:0')                                   
Epoch 1/80: mAP: nan, val_loss: nan:   0%|                                    | 0/3729 [00:01<?, ?it/s]   
Traceback (most recent call last):                   
  File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 254, in <module>                        
    train(args)                                      
  File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 100, in train                           
    x = _mel_forward(x, mel)                         
  File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 155, in _mel_forward                    
    x = mel(x)                                       
  File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)            
  File "/home/ltaylor/data/sonar/EfficientAT/models/preprocess.py", line 44, in forward                   
    x = torch.stft(x, self.n_fft, hop_length=self.hopsize, win_length=self.win_length,                    
  File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/functional.py", line 632, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]           
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR      

so not sure what to make of this. I have 8GB VRAM, could it be a really bad error message from not enough VRAM?

lukeandsee avatar Mar 18 '25 10:03 lukeandsee

Ok I think it was a package incompatability - maybe my GPU/driver were too new to use with torch 1.X. Upgrading to latest torch, torchvision, torchaudio appears to have fixed things.

Uploading my working package set, in case somebody else has this problem. requirements_updated.txt

Maybe you could try the package set out and if it works for you too, could be useful to update the repo?

lukeandsee avatar Mar 18 '25 11:03 lukeandsee

Thanks--I'll put it on my todo list. I will have to test all files with the new requirements.

fschmid56 avatar Mar 18 '25 11:03 fschmid56