audio pitch_shift consumes excessive GPU memory which is also not cleared

🐛 Describe the bug

Both torchaudio.functional.pitch_shift and torchaudio.transforms.PitchShift occupy excessive amount of GPU memory, which is not cleared, while working fine on CPU.

ISSUE 1

The following piece of code

import torch
from torchaudio.functional import pitch_shift

waveform = torch.randn(1600, device=torch.device("cuda:0"))
output_tensor = pitch_shift(waveform, 16000, n_steps=1)

raises the below exception, in a machine with a GPU having 4GB memory,

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dhanvanth/miniconda3/envs/z2/lib/python3.9/site-packages/torchaudio/functional/functional.py", line 1765, in pitch_shift
    waveform_shift = resample(waveform_stretch, int(sample_rate / rate), sample_rate)
  File "/home/dhanvanth/miniconda3/envs/z2/lib/python3.9/site-packages/torchaudio/functional/functional.py", line 1604, in resample
    kernel, width = _get_sinc_resample_kernel(
  File "/home/dhanvanth/miniconda3/envs/z2/lib/python3.9/site-packages/torchaudio/functional/functional.py", line 1522, in _get_sinc_resample_kernel
    kernels = torch.where(t == 0, torch.tensor(1.0).to(t), t.sin() / t)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.01 GiB (GPU 0; 3.95 GiB total capacity; 2.28 GiB already allocated; 340.44 MiB free; 3.04 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

In a machine with a GPU having ~48GB memory, the memory consumption to run the same code is observed to be 6.28 GB, of which the last line alone occupies 5.6 GB

On another note, the below code shows the same memory usage pattern

import torch
from torchaudio.transforms import PitchShift

waveform = torch.randn(1600, device=torch.device("cuda:0"))
effect = PitchShift(
    sample_rate = 16000,
    n_steps = 1,
).to(waveform.device)
output_tensor = effect(waveform)

ISSUE 2

The occupied GPU memory is not cleared after applying the effect

ISSUE 3

This occupied memory scales for certain values of n_steps. The below code consumes 10.67 GB GPU memory,

import torch
from torchaudio.functional import pitch_shift

waveform = torch.randn(1600, device=torch.device("cuda:1"))
output_tensor = pitch_shift(waveform, 16000, n_steps=1)
output_tensor = pitch_shift(waveform, 16000, n_steps=2)

Versions

PyTorch version: 2.0.0+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.31

Python version: 3.9.16 (main, Mar 8 2023, 14:00:05) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: Quadro P1000 Nvidia driver version: 470.182.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 39 bits physical, 48 bits virtual CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 158 Model name: Intel(R) Core(TM) i7-8850H CPU @ 2.60GHz Stepping: 10 CPU MHz: 2600.000 CPU max MHz: 4300.0000 CPU min MHz: 800.0000 BogoMIPS: 5199.98 L1d cache: 192 KiB L1i cache: 192 KiB L2 cache: 1.5 MiB L3 cache: 9 MiB NUMA node0 CPU(s): 0-11 Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Mitigation; PTE Inversion Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Meltdown: Mitigation; PTI Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Mitigation; IBRS Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Mitigation; Microcode Vulnerability Tsx async abort: Mitigation; TSX disabled Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Versions of relevant libraries: [pip3] mypy==0.961 [pip3] mypy-extensions==1.0.0 [pip3] numpy==1.23.5 [pip3] pytorch-lightning==2.0.0 [pip3] torch==2.0.0+cu118 [pip3] torch-poly-lr-decay==0.0.1 [pip3] torchaudio==2.0.0+cu118 [pip3] torchmetrics==0.11.4 [pip3] triton==2.0.0 [conda] cudatoolkit 11.3.1 h2bc3f7f_2
[conda] numpy 1.23.5 pypi_0 pypi [conda] pytorch-lightning 2.0.0 pypi_0 pypi [conda] torch 2.0.0+cu118 pypi_0 pypi [conda] torch-poly-lr-decay 0.0.1 pypi_0 pypi [conda] torchaudio 2.0.0+cu118 pypi_0 pypi [conda] torchmetrics 0.11.4 pypi_0 pypi [conda] triton 2.0.0 pypi_0 pypi

Jun 20 '23 14:06 DanTremonti

I am also having some problems with the pitch_shift function, even on CPU.

I am testing with 5.0 seconds audio at 44.1 kHz.

Runtime varies between 0.01 s and 13 s
In some cases, memory allocation fails

My test script is

import math
import time
import torch
import torchaudio


def test_pitch_shift(samplerate, f0, T, n_fft=512, hop_length=None, n_steps=[-12, 12]):
    print(f"{samplerate=}, {f0=}, {T=}, {n_fft=}, {hop_length=}")

    t = torch.arange(0, T * samplerate) / samplerate

    for n_steps in range(n_steps[0], n_steps[1] + 1):
        x = torch.sin(2.0 * math.pi * t * f0)

        t1 = time.perf_counter()
        x2 = torchaudio.functional.pitch_shift(
            x, samplerate, n_steps=n_steps, n_fft=n_fft, hop_length=hop_length
        )
        t2 = time.perf_counter()
        print(n_steps, t2 - t1)
    print()


if __name__ == "__main__":
    test_pitch_shift(44100, 2500.0, 5.0, n_fft=8192, n_steps=[-12, 12])

and the output is

samplerate=44100, f0=2500.0, T=5.0, n_fft=8192, hop_length=None
-12 0.032376209273934364
-11 1.5772914551198483
-10 0.018719196319580078
-9 0.11179743707180023
-8 12.86328736692667
-7 1.9657583087682724
-6 12.8446254003793
-5 13.580474404618144
-4 4.4222537241876125
-3 2.6473840214312077
-2 0.24345589987933636
-1 1.6089799869805574
0 0.007509412243962288
1 1.2843002695590258
2 0.022734154015779495
3 0.042376838624477386
4 6.6237110160291195
5 1.1522486191242933
6 7.690028678625822
7 0.039809875190258026
8 2.629176177084446
Traceback (most recent call last):
  File "/mnt/resource/robin/music_sep/./tests/test_pitch_shift.py", line 25, in <module>
    test_pitch_shift(44100, 2500.0, 5.0, n_fft=8192, n_steps=[-12, 12])
  File "/mnt/resource/robin/music_sep/./tests/test_pitch_shift.py", line 16, in test_pitch_shift
    x2 = torchaudio.functional.pitch_shift(
  File "/home/robin/miniconda3/envs/universe/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 1765, in pitch_shift
    waveform_shift = resample(waveform_stretch, int(sample_rate / rate), sample_rate)
  File "/home/robin/miniconda3/envs/universe/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 1604, in resample
    kernel, width = _get_sinc_resample_kernel(
  File "/home/robin/miniconda3/envs/universe/lib/python3.10/site-packages/torchaudio/functional/functional.py", line 1511, in _get_sinc_resample_kernel
    window = torch.cos(t * math.pi / lowpass_filter_width / 2) ** 2
RuntimeError: [enforce fail at alloc_cpu.cpp:75] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 13086939600 bytes. Error code 12 (Cannot allocate memory)

The versions are

Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.9.2009 (Core) (x86_64)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
Clang version: Could not collect
CMake version: version 2.8.12.2
Libc version: glibc-2.17

Python version: 3.10.11 | packaged by conda-forge | (main, May 10 2023, 18:58:44) [GCC 11.3.0] (64-bit runtime)
Python platform: Linux-4.19.119-1.20200430.el7.x86_64-x86_64-with-glibc2.17
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-PCIE-32GB
Nvidia driver version: 495.29.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             8
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 85
Model name:            Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Stepping:              7
CPU MHz:               2095.078
BogoMIPS:              4190.15
Virtualization:        VT-x
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
L3 cache:              16384K
NUMA node0 CPU(s):     0-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx
pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat pku ospke avx512_vnni md_clear arch_capabilities

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] onnxruntime==1.15.0
[pip3] pytorch-lightning==2.0.2
[pip3] torch==2.0.1
[pip3] torch-ema==0.3
[pip3] torchaudio==2.0.2
[pip3] torchinfo==1.8.0
[pip3] torchmetrics==0.11.4
[pip3] torchvision==0.15.2
[pip3] triton==2.0.0
[conda] blas                      1.0                         mkl    conda-forge
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.2.1         h84fe81f_16997    conda-forge
[conda] numpy                     1.24.3          py310ha4c1d20_0    conda-forge
[conda] pytorch                   2.0.1           py3.10_cuda11.7_cudnn8.5.0_0    pytorch
[conda] pytorch-cuda              11.7                 h778d358_5    pytorch
[conda] pytorch-lightning         2.0.2                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch-ema                 0.3                      pypi_0    pypi
[conda] torchaudio                2.0.2               py310_cu117    pytorch
[conda] torchinfo                 1.8.0                    pypi_0    pypi
[conda] torchmetrics              0.11.4                   pypi_0    pypi
[conda] torchtriton               2.0.0                     py310    pytorch
[conda] torchvision               0.15.2              py310_cu117    pytorch

Nov 22 '23 08:11 fakufaku

I did notice the pitch_shift transformation output requires grad, is this the supposed behaviour? May this have something in common with this issue?

Apr 23 '24 12:04 dragonsearch