Question about details of data preprocessing.
Hi! Thanks for releasing the code!
I have trained the model with the features you provided and achieves similar performance. And now I‘m trying to train a model using data precessed myself. I download ImageNet-VID datasets and extract optical flow using OpenCV TVL1 algorithm implemented by MMAction . The Optical Flow is calculated on the original size of video frames without resize. Then, I crop the region proposal according to the tubes you provided and resize them to 224*224. Finally, follow this, I rescale the value of RGB and Flow images between -1 and 1 (by x/127.5-1) and feed them to the official i3d model.
However, the features I get are somewhat different from that I download. The cosine similarity of i3d-RGB feature is about 0.95 and i3d-Flow feature is about 0.8. The similarity is not too low, but there are still some gaps, especially the flow feature. Is there anything wrong with the process I mentioned above?
Here are some question.
- After resizing the flow image to 224*224, does its value need to be scaled?
- How to crop and normalize the images? I use PIL to read images and bilinear interpolation to resize images.
Thanks a lot!
BTW I find that there are some I3D features and tubes that don't correspond. For example, 1144.pk in vid_rgb and 1144.pd in tube_Prp have 63 frames, but 1144.h5 in vid_i3d has 126 frames.
- Thank you for your interest and questions!
- Currently, I have no time to check the frames. One possible reason may be the frame decoder of the videos have a different fps rate. Part of the frames/ features are extracted by my colleagues during my internship at Tencent. I need to re-check the code for feature extraction and then give you a more detailed response. I am currently busy with my new research projects. I will come back to this issue when I have more spare time.