Inquiring about training codes
Thanks for the excellent work! could you please release the training code when you are available?
Hey @xuanxu92, thanks for your interest in our project.
Our training code is based on PIA. We re-implement the attention operation of Motion Module and use the following attention mask during temporal attention.
def make_tril_block_mask(video_length: int, patch_size: int, device):
"""
tensor([[[1., 1., 0., 0.],
[1., 1., 0., 0.],
[1., 1., 1., 0.],
[1., 1., 1., 1.]]])
"""
tmp_mask = torch.zeros(video_length, video_length)
# warmup steps
for idx in range(patch_size):
tmp_mask[idx, :patch_size] = 1
# tril blocks
for idx in range(patch_size, video_length):
tmp_mask[idx, :idx + 1] = 1
tmp_mask = tmp_mask.type(torch.bool)
mask = torch.zeros_like(tmp_mask, dtype=torch.float)
mask.masked_fill_(tmp_mask.logical_not(), float('-inf'))
return mask.to(device)
Hey @xuanxu92, thanks for your interest in our project.
Our training code is based on PIA. We re-implement the attention operation of Motion Module and use the following attention mask during temporal attention.
def make_tril_block_mask(video_length: int, patch_size: int, device): """ tensor([[[1., 1., 0., 0.], [1., 1., 0., 0.], [1., 1., 1., 0.], [1., 1., 1., 1.]]]) """ tmp_mask = torch.zeros(video_length, video_length) # warmup steps for idx in range(patch_size): tmp_mask[idx, :patch_size] = 1 # tril blocks for idx in range(patch_size, video_length): tmp_mask[idx, :idx + 1] = 1 tmp_mask = tmp_mask.type(torch.bool) mask = torch.zeros_like(tmp_mask, dtype=torch.float) mask.masked_fill_(tmp_mask.logical_not(), float('-inf')) return mask.to(device)
Hey, Thank you for your tips. Besides, I just want to try live2diff with uni-directional attention and wonder what will happen. Is the training set of the results show in Live2diff Figure 3(d) same as the warp-up uni-directional attention training set? e.g. 3000 steps, batchsize=1024, lr=1e-4.
Hey @xuanxu92 , sorry for late response. I check the history of your comment. For the historical comments: If you apply "full-uni-directional" attention (i.e., causal attention used in LLMs), it's understandable that the initial frames may become stuck, as the first few frames in Live2Diff are trained with "bi-directional" attention.
For the current comment, the answer is "yes."