A question about the processing of video datasets
Hellow!For video datasets, such as Mars, I would like to ask whether a tracklet is similar to a single frame image,? And whether all the frames in a tracklet are input into the network at the same time?
Hi,
For the first question, yes.
For the second one, 16 frames out of the tracklet are input into the network during training. But for inference, we input all frames at the same time.
Why do you select 16 instead of inputing them all into the network when training?
Because we do not have tooooo much GPU memory for training. The largest tracklet has more than 1,000 frames, which needs 60x times GPU memory cost.
Thank you very much for your reply!