speakerbox `AudioSegment.from_file` is not restricted to WAV files

Feature Description

In preprocess.py, you use AudioSegment.from_file to load the audio files. According to definition of this function, it seems it accepts a bunch of file formats. but at the previous line, you restrict the user to wav files for ... in label_dir.glob("*.wav") Is there a reason to that? Mp3/ogg/etc. files would be useful, as they take less place on the disk.

Use Case

Basic preprocess then train, as in README.

Solution

replace .glob('*wav') by a list of formats pydub accepts. I didn't find a list yet, but the implementation of from_file gives us some hints of the formats accepted.

Dec 04 '22 15:12 NicolasMICAUX

My only reasoning for that was simply that I didn't know the list of accepted formats and I primarily work with wav. If you want to make a PR, I would happily accept one!

Dec 05 '22 17:12 evamaxfield

Ok! I will probably do a PR for this when I have some time.

Dec 05 '22 22:12 NicolasMICAUX

Pydub is a library for loading and manipulating various formats of audio files. Therefore it does support various file formats.

However here it is only being used as a pre-processing step. The diarization models within Pyannote were trained on .wav files and therefore expect that format.

Also it is good practice to not use "lossy" formats when storing audio/video files if you plan to use them as inputs to ML/Deep Learning models downstream. These lossy formats such as .mp3 and .ogg abstract information away and the models work best with as much information preserved as possible.

This also means if you store them in lossy formats to save on storage costs and then convert them to .wav for inference that information is never restored and remains lost when converting back to non-lossy formats(such as .wav). This will result in lower quality outputs from the models.

Sorry for the late input...

I think it is safe to close this one?

Aug 17 '24 19:08 kristopher-smith