Siyi Chen
Siyi Chen
Thank you! Downsampling by n means selecting 1 frame per n frames, do you first select those frames and then calculate RGBDiff?
Thank you! How about the result obtained in Table 12? Do you also do self-supervised training on two encoders for RGB and RGBDiff, and average the the similarity of two...
Thank you! Also I am curious, have tried to train on the full K400 dataset, would that help or harm the model? > On Sep 13, 2023, at 4:35 AM,...
What's more, I think the author's implementation of `encode_image_mini` may always fuse all text tokens with image tokens - this is a problem during training, since the "answer" text part...