temporal-shift-module
temporal-shift-module copied to clipboard
Why the input shape is (NT,C,H,W)?
Hi, I wonna use TSM on my own dataset, which is a video-like input(each gesture have 32 frames,so my input shape is (N,C,T,H,W)). But when I use a 2D conv backbone (such as resnet50), it needs a 4 dimensional input. So what should I do to merge my own input to a 4-D input? If I use x.view(NT,C,H,W), my data got the 4-D input, but the label size is still N, so here comes the mismatch. I don't know how to solve the problem.
Did you convert all your videos to a set of frames?