NeMo Question about the settings in speech_data

Hi, I'm currently using NeMo/tools/speech_data_simulator to fine-tune the MSDD model and have some questions about the data_simulator.

1. How can I ensure that every session has exactly as many speakers as `num_speakers`?

Currently in my case, sessions are occasionally created that contain fewer speakers than num_speakers.
This seemed to become more frequent as num_speakers became larger than 4. For example, I've created 32 sessions with num_speakers as 4, but 9 sessions include only 3 speakers.

I used a custom dataset as an input to this simulator, and the total number of speakers in the dataset was around 50.
The minimum number of utterances from speakers was 300, and the average length of an utterance was about 5 seconds.

As far as I've looked up, the following parameters are related with the above question:

https://github.com/NVIDIA/NeMo/blob/0e744c9300ca99060696b3536978ff5629312071/tools/speech_data_simulator/conf/data_simulator.yaml#L8-L9

https://github.com/NVIDIA/NeMo/blob/0e744c9300ca99060696b3536978ff5629312071/tools/speech_data_simulator/conf/data_simulator.yaml#L79-L83

https://github.com/NVIDIA/NeMo/blob/0e744c9300ca99060696b3536978ff5629312071/tools/speech_data_simulator/conf/data_simulator.yaml#L18-L21

I tried tweaking the settings to fix this, but nothing worked.

My current setup is as follows:

config.data_simulator.session_config.num_speakers = # This setting varies from 2 to 6
config.data_simulator.session_config.session_length = # This setting varies from 10min to 40min
config.data_simulator.session_params.min_dominance = 1 / (num_speakers + 1)
config.data_simulator.session_params.mean_silence = 0.08
config.data_simulator.session_params.turn_prob=0.875
config.data_simulator.session_params.min_turn_prob=0.875
config.data_simulator.speaker_enforcement.enforce_num_speakers = True
config.data_simulator.speaker_enforcement.enforce_time = {0: 1.0, 1: 1.0} # I've tried {0: 0.75, 1: 1.0}, {0: 0.99, 1: 1.0}, too

2. Why the default value of `sentence_length_params` is not an integer?

According to the comments, the value of sentence_length_params must be a positive integer but the value is set to 0.4. The session itself creates fine with this setting, but I'd like to ask why this is the default.

https://github.com/NVIDIA/NeMo/blob/0f2874b270f476405f11aeb09d38a709118c67b5/tools/speech_data_simulator/conf/data_simulator.yaml#L15-L17

Thank you in advance.

May 23 '24 01:05 sappho192

Just in case, I've been using the latest version of NeMo with:

apt-get update && apt-get install -y libsndfile1 ffmpeg
git clone https://github.com/NVIDIA/NeMo
cd NeMo
./reinstall.sh

May 23 '24 06:05 sappho192

@tango4j, could you check the above issue with num_speakers and sentence_length_params?

May 23 '24 18:05 anteju

How can I ensure that every session has exactly as many speakers as num_speakers?

We need a little more time to figure out why enforce_num_speakers: true is not working as expected for 10mins sessions and more than 4 speakers. @tango4j we have a primitive fix in mind but need to further test it

Why the default value of sentence_length_params is not an integer?

You're right that the k in sentence_length_params should usually be an integer. We use a default 0.4 in order to match the segment length distribution in AMI dataset, but you can set it to other integer values, and that would generally increase the lengths of segments

May 30 '24 13:05 stevehuang52

We need a little more time to figure out why enforce_num_speakers: true is not working as expected for 10mins sessions and more than 4 speakers.

@stevehuang52 Thank you for figuring out this issue. You can check these dataset I used and the simulated meetings I've generated in case it helps: [dataset(4.0GB)] [alignments in simple,condensed format(2MB)] [generated sim_meet over 2~6 speakers(4.3GB)]

You're right that the k in sentence_length_params should usually be an integer. We use a default 0.4 in order to match the segment length distribution in AMI dataset, but you can set it to other integer values, and that would generally increase the lengths of segments

Thanks a lot. Then I'll set it to an integer in my case.

May 31 '24 01:05 sappho192

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Jun 30 '24 01:06 github-actions[bot]

(Just a bump)

Jul 01 '24 00:07 sappho192

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

Aug 01 '24 01:08 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

Aug 09 '24 01:08 github-actions[bot]

Question about the settings in speech_data_simulator

1. How can I ensure that every session has exactly as many speakers as num_speakers?

2. Why the default value of sentence_length_params is not an integer?

1. How can I ensure that every session has exactly as many speakers as `num_speakers`?

2. Why the default value of `sentence_length_params` is not an integer?