Yang Yang
Yang Yang
你好,可以加一下群吗,麻烦发一下二维码谢谢 [email protected]
> Hi, may I ask why you calculate the sqrt here
same issue encountered. I found that it is due to the denormal numbers (< 1E-32) in the weights, please refer to https://discuss.pytorch.org/t/conv2d-is-very-slow-on-trained-weights-vs-random-weights/43377/4 BTW, which dataset did you use for training?
I just used default values and run tag v1.1 with the command: `python main.py --mode train --input ./data/UTKFace `--output` ./results` I still got the same error.
要研究具体实现,可以逐步把processor 的输入输出shape 打印出来看。我尝试先给一个最终输出的解释,多帧图片经过processor处理后变成的是一个铺平的patches序列, with shape: (grid_t * grid_h * grid_w, in_channel * temporal_patch_size * patch_size * patch_size) - grid_x 为在x维度上grid的数量,如grid_h = image_height // patch_size - in_channel: 输入图片channel数,default是RGB 为3 - temporal_patch_size:...
这个序列中patch的顺序还要考虑到三个维度以及spatial merge的操作,以两帧图片为例,每帧图片大小为 (1, 6, 8) 假设in_channel = 1, patch_size = 1, 每个patch即为一个像素点,标记序号如下 `[[[ [ 1 1 2 3 4 5 6 7], [ 8 9 10 11 12 13 14...
有相同的疑问。感觉应该先做conv3d再flatten成token sequence,但是代码实现刚好相反。 raised a thread here to draw more visibility https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct/discussions/39
可以看下我放在HF上的问题,从mask 角度看,模型并没有学习到多帧之间的关联,只在两帧组内部进行attention。也就是说无所谓有没有在time dim上进行3D conv,最后还是没关注时间维度。单纯从理论上看,视频理解能力有一定局限性。
``` class FramePatchEmbed(nn.Module): def __init__(self, patch_size, temporal_patch_size, in_channels, embed_dim, spatial_merge_size ): super().__init__() self.patch_size = patch_size self.temporal_patch_size = temporal_patch_size self.in_channels = in_channels self.embed_dim = embed_dim self.spatial_merge_size = spatial_merge_size kernel_size = (temporal_patch_size,...
我看到近期Qwen2.5-VL 相比Qwen2-VL在时间维度上也应用了dynamic resolution的概念,是对这两个issues提及问题的一个改进。