Cheng Yu comments

Results 6 comments of


                                            Cheng Yu

About the pad issues in the code

> @hyc9 大佬，对不起，我有个疑问：在计算多头注意力得分时，为什么不需要像以前的做法一样，需要把之后的交互给mask掉？比如以往的做法都会加上一个mask矩阵：attention_scores = attention_scores + attention_mask 而本代码见如下，并没有进行mask操作。对此感到疑问！希望能得到解答！！！ attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) attention_scores = attention_scores / math.sqrt(self.attention_head_size) > > normalize the attention scores to probabilities. attention_probs =...

Cheng Yu

About the pad issues in the code

About the pad issues in the code

Process with pixelnet

Process with pixelnet

Process with pixelnet

你好，比如llama这些LLM模型，我看统计的结果好像没有包括attention矩阵的[email protected]那部分计算量。