CUFFT_INTERNAL_ERROR when finetuning
Hi! I'm attempting to finetune on openmic, I have the dataset and am running the recommended:
python ex_openmic.py --cuda --train --pretrained --model_name=dymn10_as --lr=2e-5 --batch_size=32
I get:
Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_train.csv_mp3.hdf with length 14915.
Mixing up waveforms from dataset of len 14915
Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_test.csv_mp3.hdf with length 5085.
Epoch 1/80: mAP: nan, val_loss: nan: 0%| | 0/467 [00:02<?, ?it/s]
Traceback (most recent call last):
File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 254, in <module>
train(args)
File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 100, in train
x = _mel_forward(x, mel)
File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 155, in _mel_forward
x = mel(x)
File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ltaylor/data/sonar/EfficientAT/models/preprocess.py", line 42, in forward
x = torch.stft(x, self.n_fft, hop_length=self.hopsize, win_length=self.win_length,
File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/functional.py", line 632, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
I have a suspicion that it may be an incompatibility thing and I just have a different set of pytorch, CUDA etc than you. Would you kindly share the full set of packages for which this works? pip freeze > requirements.txt In particular I have pytorch 1.13.0+cu117
Hi @lukeandsee
I installed a fresh setup and for me it worked out of the box with the following attached requirements.
However, I think that a corrupted dataset is more likely.
Best, Florian
hey, thanks for reply! I've tried redownloading the dataset and with a fresh virtualenv with those packages you send and adding some code to print a sample of the input tensor (to make sure it's not all 0s or something)
print(*values: "x shape:",*values: x.shape)
print(*values: "x sample", *values: x[*values: :4,*values: :4])
print(*values: "n_fft", *values: self.n_fft)
print(*values: "n_fft", *values: self.n_fft)
print(*values: "hopsize", *values: self.hopsize)
print(*values: "win_length", *values: self.win_length) print(*values: "window", *values: self.window[*values: :4])
and I see:
Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_train.csv_mp3.hdf with length 14915.
Mixing up waveforms from dataset of len 14915
Dataset from /home/ltaylor/data/sonar/PaSST/audioset_hdf5s/mp3/openmic_test.csv_mp3.hdf with length 5085.
Epoch 1/80: mAP: nan, val_loss: nan: 0%| | 0/467 [00:00<?, ?it/s]x shape: torch.Size([32, 319999])
x sample tensor([[ 4.0114e-03, 4.0245e-03, 5.0103e-03, 6.6861e-03],
[ 1.4319e-08, -2.2448e-09, -1.6799e-08, -1.8434e-09],
[ 2.8969e-02, 3.1241e-02, 2.8138e-02, 2.4127e-02],
[-4.7662e-02, -4.5411e-02, 1.3548e-02, -1.3909e-02]], device='cuda:0')
n_fft 1024
n_fft 1024
hopsize 320
win_length 800
window tensor([0.0000e+00, 1.5467e-05, 6.1840e-05, 1.3915e-04], device='cuda:0')
Epoch 1/80: mAP: nan, val_loss: nan: 0%| | 0/3729 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 254, in <module>
train(args)
File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 100, in train
x = _mel_forward(x, mel)
File "/home/ltaylor/data/sonar/EfficientAT/ex_openmic.py", line 155, in _mel_forward
x = mel(x)
File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/ltaylor/data/sonar/EfficientAT/models/preprocess.py", line 44, in forward
x = torch.stft(x, self.n_fft, hop_length=self.hopsize, win_length=self.win_length,
File "/home/ltaylor/data/sonar/EfficientAT/EAT/lib/python3.10/site-packages/torch/functional.py", line 632, in stft
return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]
RuntimeError: cuFFT error: CUFFT_INTERNAL_ERROR
so not sure what to make of this. I have 8GB VRAM, could it be a really bad error message from not enough VRAM?
Ok I think it was a package incompatability - maybe my GPU/driver were too new to use with torch 1.X. Upgrading to latest torch, torchvision, torchaudio appears to have fixed things.
Uploading my working package set, in case somebody else has this problem. requirements_updated.txt
Maybe you could try the package set out and if it works for you too, could be useful to update the repo?
Thanks--I'll put it on my todo list. I will have to test all files with the new requirements.