Why the input shape is (NT,C,H,W)?

Open zzwei1 opened this issue 5 years ago • 1 comments

Hi, I wonna use TSM on my own dataset, which is a video-like input(each gesture have 32 frames,so my input shape is (N,C,T,H,W)). But when I use a 2D conv backbone (such as resnet50), it needs a 4 dimensional input. So what should I do to merge my own input to a 4-D input? If I use x.view(NT,C,H,W), my data got the 4-D input, but the label size is still N, so here comes the mismatch. I don't know how to solve the problem.

Nov 06 '20 03:11 zzwei1

Did you convert all your videos to a set of frames?

Jan 26 '21 10:01 Fritskee