syncnet_trainer Negative audio samples for M way matching

Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.

I only see positive samples

Oct 15 '20 00:10 ak-7

The negative samples are the features at different timesteps, within the same batch. In output in this line: https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50 all non-diagonal elements are negatives.

Oct 15 '20 14:10 joonson

Thanks for that explanation.

Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?

Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.

Oct 15 '20 16:10 ak-7

Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?

Yes

Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.

The predictions lose two frames to both sides because of the receptive field. So for example you need to look at 5th feature to see the output corresponding to 5th-9th frames.

Oct 16 '20 13:10 joonson

Changing the audio kernel size here messes up the dimensions of the model. How did you account for context window size and M inside the model?

Nov 18 '20 01:11 ak-7