SSL_for_multitalker icon indicating copy to clipboard operation
SSL_for_multitalker copied to clipboard

ADAPTING SELF-SUPERVISED MODELS TO MULTI-TALKER SPEECH RECOGNITION USING SPEAKER EMBEDDINGS

We provide the code and models for our ICASSP paper Adapting self-supervised models to multi-talker speech recognition using speaker embeddings.

Requirements and Installation

  • Python version == 3.7
  • torch==1.10.0, torchaudio==0.10.0
# Install fairseq
git clone -b multispk --single-branch https://github.com/HuangZiliAndy/fairseq.git
cd fairseq
pip install --editable ./

# Install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \

pip install -r requirements.txt 

Data prepare

# Prepare LibriMix (https://github.com/JorisCos/LibriMix)
# We only need 16k max condition in our experiment, and train-360
# is not needed.

# Install Kaldi (https://github.com/kaldi-asr/kaldi)

# Link utils to current directory
ln -s <kaldi_dir>/egs/wsj/s5/utils .

# Follow the following two scripts to prepare fairseq style
# training data for LibriMix

# The difference between the following two scripts is that
# the former makes use of force alignment results to create
# tight boundary (utterance-based evaluation)
./myscripts/LibriMix/prepare_librimix.sh
./myscripts/LibriMix/prepare_librimix_full_len.sh

Extract speaker embeddings for enrollment utterances. We use 15s speech from LibriVox (not in LibriSpeech) LS 15 seconds enrollment as enrollment utterances. We also offer extracted x-vector embeddings.

Training

Download wavLM models and put it under downloads directory

We offer a few example scripts for training.

# Utterance-based evaluation (wavLM Base+ without speaker embedding)
./train_scripts/LS_wavLM.sh

# Utterance-based evaluation (wavLM Base+ with speaker embedding)
./train_scripts/LS_wavLM_spk.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding)
./train_scripts/LS_full_len_wavLM_spk.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding + Joint Speaker Modeling (JSM))
./train_scripts/LS_full_len_wavLM_spk_JSM.sh

Evaluation

# Utterance-based evaluation with and w/o speaker embedding
./eval_scripts/LS.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding)
./eval_scripts/LS_full_len.sh

# Utterance group-based evaluation (wavLM Base+ with speaker embedding + JSM)
./eval_scripts/LS_full_len_JSM.sh

Pretrained models

Utterance-based evaluation (wavLM Base+ without speaker embedding)

Utterance-based evaluation (wavLM Base+ with speaker embedding)

Utterance group-based evaluation (wavLM Base+ with speaker embedding)

Utterance group-based evaluation (wavLM Base+ with speaker embedding + JSM)

When you are doing inference using the pretrained model, please first convert the model using

python myscripts/convert_model.py <model_dir>/checkpoint_last.pt downloads/WavLM-Base+.pt <model_dir>/checkpoint_last_tmp.pt
mv <model_dir>/checkpoint_last_tmp.pt <model_dir>/checkpoint_last.pt

Citation

Please cite as:

@inproceedings{huang2023adapting,
  title={Adapting self-supervised models to multi-talker speech recognition using speaker embeddings},
  author={Huang, Zili and Raj, Desh and Garc{\'\i}a, Paola and Khudanpur, Sanjeev},
  booktitle={IEEE ICASSP},
  year={2023},
}