syncnet_trainer icon indicating copy to clipboard operation
syncnet_trainer copied to clipboard

Negative audio samples for M way matching

Open ak-7 opened this issue 5 years ago • 4 comments

Where are the negative audio samples being generated for M-way matching problem? I just see load_wav function samples the audio corresponding to the starting index in video frame.

I only see positive samples

ak-7 avatar Oct 15 '20 00:10 ak-7

The negative samples are the features at different timesteps, within the same batch. In output in this line: https://github.com/joonson/syncnet_trainer/blob/15e5cfcbe150da8ed5c04cfe74a011319ae60d06/SyncNetDist.py#L50 all non-diagonal elements are negatives.

joonson avatar Oct 15 '20 14:10 joonson

Thanks for that explanation.

Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?

Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.

ak-7 avatar Oct 15 '20 16:10 ak-7

Is the context window for video and audio frames decided by the kernel size of the first audio and video conv layer? For ex: If we want a context window of size 5 we set the kernel size to be 5?

Yes

Also aren't the predictions left centered for this context window? For every frame we take future context of 5 frames to predict the label corresponding to that frame.

The predictions lose two frames to both sides because of the receptive field. So for example you need to look at 5th feature to see the output corresponding to 5th-9th frames.

joonson avatar Oct 16 '20 13:10 joonson

Changing the audio kernel size here messes up the dimensions of the model. How did you account for context window size and M inside the model?

ak-7 avatar Nov 18 '20 01:11 ak-7