hubert RuntimeError: stack expects each tensor to be equal size

Hello there. I'm in the process of setting up a Colab notebook to train a couple models needed for inference in Soft-VC, mostly for personal ease-of-access. One of the first steps is obtaining a custom trained/finetuned HuBERT model.

I've spent trial and error making sure the directories are properly set up/prepared for usage in training/finetuning on Colab, and I believe I have something that works. There is of course one issue however...

When beginning to train the model (running on Colab with a Tesla T4), at first it initializes properly but soon after I'm thrown a RuntimeError. Here's the full log:

INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 0
INFO:torch.distributed.distributed_c10d:Rank 0: Completed store-based barrier for 1 nodes.
INFO:numexpr.utils:NumExpr defaulting to 2 threads.
/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py:481: UserWarning: This DataLoader will create 8 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
  cpuset_checked))
INFO:__mp_main__:********************************************************************************
INFO:__mp_main__:PyTorch version: 1.9.1+cu102
INFO:__mp_main__:CUDA version: 10.2
INFO:__mp_main__:CUDNN version: 7605
INFO:__mp_main__:CUDNN enabled: True
INFO:__mp_main__:CUDNN deterministic: False
INFO:__mp_main__:CUDNN benchmark: False
INFO:__mp_main__:# of GPUS: 1
INFO:__mp_main__:batch size: 64
INFO:__mp_main__:iterations per epoch: 1
INFO:__mp_main__:# of epochs: 25001
INFO:__mp_main__:started at epoch: 1
INFO:__mp_main__:********************************************************************************

/content/hubert/train.py:232: FutureWarning: Non-finite norm encountered in torch.nn.utils.clip_grad_norm_; continuing anyway. Note that the default behavior will change in a future release to error out if a non-finite total norm is encountered. At that point, setting error_if_nonfinite=false will be required to retain the old behavior.
  nn.utils.clip_grad_norm_(hubert.parameters(), MAX_NORM)
INFO:__mp_main__:
            train -- epoch: 1, masked loss: 5.6476, unmasked loss: 6.2735, 
                     masked accuracy: 1.80, umasked accuracy: 1.59
            
INFO:root:Reducer buckets have been rebuilt in this iteration.
INFO:__mp_main__:
            train -- epoch: 2, masked loss: 5.7180, unmasked loss: 6.3430, 
                     masked accuracy: 2.09, umasked accuracy: 2.44
            
INFO:__mp_main__:
            train -- epoch: 3, masked loss: 5.7401, unmasked loss: 6.3748, 
                     masked accuracy: 1.54, umasked accuracy: 2.33
            
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/content/hubert/train.py", line 202, in train
    for wavs, codes in train_loader:
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/content/hubert/hubert/dataset.py", line 90, in collate
    codes = torch.stack(collated_codes, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [170] at entry 5


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 99, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 66, in _wrap
    sys.exit(1)
SystemExit: 1

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.7/multiprocessing/process.py", line 300, in _bootstrap
    util._exit_function()
  File "/usr/lib/python3.7/multiprocessing/util.py", line 357, in _exit_function
    p.join()
  File "/usr/lib/python3.7/multiprocessing/process.py", line 140, in join
    res = self._popen.wait(timeout)
  File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 48, in wait
    return self.poll(os.WNOHANG if timeout == 0.0 else 0)
  File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 28, in poll
    pid, sts = os.waitpid(self.pid, flag)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 8288) is killed by signal: Terminated. 
Traceback (most recent call last):
  File "train.py", line 452, in <module>
    join=True,
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 150, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/content/hubert/train.py", line 202, in train
    for wavs, codes in train_loader:
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/usr/local/lib/python3.7/dist-packages/torch/_utils.py", line 425, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/content/hubert/hubert/dataset.py", line 90, in collate
    codes = torch.stack(collated_codes, dim=0)
RuntimeError: stack expects each tensor to be equal size, but got [0] at entry 0 and [170] at entry 5

As you can see, it actually starts training up to 3 epochs, then stops suddenly due to the error. On some occasions when trying it again, I've seen it go up to at least 5 or 6 epochs before throwing the same error. Other cases it will immediately throw the error even before 1 epoch.

From what I was able to find, I believe adding some kind of padding in hubert/dataset.py to the collated wavs and codes could potentially work. Unfortunately I wouldn't know for sure how to implement that. Mainly I'm just interested in training/finetuning a personal model to test custom Soft-VC inference.

This could also be the fault of something else. I've definitely made sure to follow the training prep you provided (much appreciation for that by the way) and everything, but I should mention the dataset I'm using is 22khz. If that's an issue, please let me know.

Great potential for this! Just need to put the puzzle together so-to-speak.

Oct 06 '22 16:10 rkjoan

have you found the solution? or is this repo already dead?

Jan 17 '23 05:01 devNegative-asm

I have the same problem.

Aug 18 '23 12:08 EmreOzkose

I resample wavs to 16k and extract discrete units again. Then it is solved.

Aug 18 '23 12:08 EmreOzkose