audio Add a torch implementation of "convolution reverb"

🚀 The feature

A pure Pytorch implementation of the "convolution reverb" like described in https://pytorch.org/audio/stable/tutorials/audio_data_augmentation_tutorial.html#simulating-room-reverberation

This should be implemented like "pitch shift" both in "functional" and as a module.

Motivation, pitch

torchaudio has been recently ramping up the utilities for data augmentation but so far the convolution reverb hasn't been implemented. Having a pure pytorch implementation would allow to run it efficiently on GPU.

Alternatives

note that sox_effects can be used to provide a "reverb" but it's a CPU only implementation. Note that sox is not doing a convolution reverb but another algorithm

Additional context

I'm a Meta employee and @roa-beep and @gziz will be working on it throught the MLH fellowship.

Feb 09 '23 14:02 gwenzek

Note that the Convolution Reverb require some Room Impulse Response samples. What is the good practices for downloading such resources in torchaudio ?

Should we ask the user to explicitly gave us such RIR sample ? or can we download a default one silently ?

SAMPLE_RIR= torchaudio.utils.download_asset("tutorial-assets/Lab41-SRI-VOiCES-rm1-impulse-mc01-stu-clo-8000hz.wav")

Feb 09 '23 14:02 gwenzek

@gwenzek thanks for the feature request. This sounds good. Could you elaborate on how such an operator would differ from our existing convolution operators, e.g. fftconvolve and convolve?

Re: RIR samples — our preference would be to have the user provide one for the sake of maintainability. If there exists some RIR dataset that we'd like users to use, we can facilitate fetching samples by adding a corresponding class to torchaudio.datasets.

Feb 14 '23 19:02 hwangjeff

I think we can provide couple of APIs.

The low level function that convolve the given waveform with given RIR.
A high level transform that convolve the given waveform with an RIR from predefined set.

And one can configure the RIR set to be used in the transform. To configure the RIR set, defining the interface expected by the transform should be straightforward. In very simple view, this is iterable with infinite length, which keeps returning the new RIR. If we add RIR dataset class, then an instance of such class can be extended to be used by the said RIR transform.

Feb 14 '23 19:02 mthrok

I am also open to the idea of default (set of) RIR. Semantically, it's hard to make sense of default RIR, (does it represent room with average acoustic characteristic?) but maybe this could be the kind of thing, users are happy/okay with whatever default.

Feb 14 '23 19:02 mthrok

@mthrok That's a good question. How do we define the standard RIR? For instance, would the RIR you provided in your audio_data_augmentation.ipynb tutorial be considered a good default RIR? Specifically, I want to focus on the extracting the impulse part, because what if instead of extracting from second 1.1 to second 1.3, as is done in the tutorial, we would have extracted from second 1.1 to 1.4? The augmented speech would have sounded more reverberated. Hence, how do we decide the interval of time for the impulse extraction and agree that would be a good default RIR?

P.S. Thanks for doing the tutorial, it's extremely helpful.

Apr 04 '23 18:04 gziz