Cheng Yu
Cheng Yu
> @hyc9 大佬,对不起,我有个疑问: 在计算多头注意力得分时,为什么不需要像以前的做法一样,需要把之后的交互给mask掉? 比如以往的做法都会加上一个mask矩阵:attention_scores = attention_scores + attention_mask 而本代码见如下,并没有进行mask操作。对此感到疑问!希望能得到解答!!! attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) attention_scores = attention_scores / math.sqrt(self.attention_head_size) > > normalize the attention scores to probabilities. attention_probs =...
> @hyc9 大佬,这里的代码所有的序列推荐的训练方式好像都是采用自回归的方式? 这里代码采用的应该是1,2,3,4->5这样的方式 + 序列数据增强吧
Hi. For your reference, most experiments can be run on a GPU with 32G of memory. If you encounter insufficient memory issues, you could consider adjusting the training parameters in...
> Many thanks for your instruction! When I run the code, I find the problem as below. I guess this may be caused by my torch version. I am not...
> ``` > num_workers = 10 # set it to a lower number like 1, 2, 3 .... > rank = torch.distributed.get_rank() > seed = torch.initial_seed() > ``` > >...
好像直接用torch.nn.MultiheadAttention没有问题,但是自己写的q @ k没有被统计的样子,这是为什么呢?