Explicit-Sparse-Transformer top-k功能实现的代码

您好，我想咨询一下实现top-k功能的代码都集中在sparse_activated_multihead_attention.py中的SparseActivatedMultiheadAttention类里了吗？

Mar 03 '22 13:03 z972778371

Yes

Mar 03 '22 14:03 zhaoguangxiang

Yes

您好，关于top-k功能代码部分，我有些问题想请教您一下： 1、首先就是代码中许多参数不太明白它是用来干什么的。 1）例如parameters中的self.onnx_trace、entmax、bmm_fp16_support、cur_san_active等

2、代码260行的attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz) 查apply_sparse_mask函数的define 仅是返回了attn_weights，并未做任何处理，这一步是什么用处？ 3、代码中的entmax用的是tf，原论文的pytorch版本可以平替代码中的entmax吗？

Mar 06 '22 03:03 z972778371

You can ignore them, “onnx_trace” is from fairseq, bmm_fp16_support is to detect whether multihead attention can use fp16, cur_san_active is to decide whether to sparse encoder self attention, decoder self attention, decoder cross attention
Line 260 has nothing to do with our implementation
I didn’t follow you. The entmax in sparse activated multihead attention is pytorch version.

从 Windows 版邮件发送

发件人: z972778371 发送时间: 2022年3月6日 10:54 收件人: lancopku/Explicit-Sparse-Transformer 抄送: Guangxiang Zhao; Comment 主题: Re: [lancopku/Explicit-Sparse-Transformer] top-k功能实现的代码 (Issue #2)

Yes 您好，关于top-k功能代码部分，我有些问题想请教您一下： 1、首先就是代码中许多参数不太明白它是用来干什么的。 1）例如parameters中的self.onnx_trace、entmax、bmm_fp16_support、cur_san_active等 2、代码260行的attn_weights = self.apply_sparse_mask(attn_weights, tgt_len, src_len, bsz) 查apply_sparse_mask函数的define 仅是返回了attn_weights，并未做任何处理，这一步是什么用处？ 3、代码中的entmax用的是tf，原论文的pytorch版本可以平替代码中的entmax吗？ — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

Mar 06 '22 04:03 zhaoguangxiang

非常感谢您的回复。关于您的代码我还有一些问题请教，如果您能帮我解答，将不尽感激^_^ 目前我模型的attention_mask仅仅把文本padding的位置mask为-∞，我想对它引入稀疏注意力来检查效果是否有进一步提升。 PS：我的代码是把attention计算、encoder、decoder和transformer分成4个python file，可能需要将您的代码分块调用实现。 1、您代码中args参数是什么？ self.args = args、self.div = args.div、self.lb = args.lb，包括cur_san_active的bool值和self.entmax 也是根据args.use_att判断的，所以想知道一下参数args值是怎么设置的。 2、参数self.div和self.lb的值决定变量top_k的值，这两个参数是在args人为设置还是？ 3、代码297行-312行，根据self.entmax来判断使用哪种形式的归一化操作。原论文提出的是1.5-entmax，那么在实际运行中，参数args.entmax的值是设置为2吗？

Mar 07 '22 14:03 z972778371

args is setting in fairseq/model/transformer
they are set by you
Yes

Mar 07 '22 22:03 zhaoguangxiang

您好，请问entmax15和top-k是如何选择的呢？在您sparse_activated_multihead_attention.py代码中entmax和top-k是二选一的，在您测试的经验来看，两者各适用于什么情况？

Mar 08 '22 07:03 z972778371

Top-k is our proposal. Entmax is also excellent.

从 Windows 版邮件发送

发件人: z972778371 发送时间: 2022年3月8日 14:55 收件人: lancopku/Explicit-Sparse-Transformer 抄送: Guangxiang Zhao; Comment 主题: Re: [lancopku/Explicit-Sparse-Transformer] top-k功能实现的代码 (Issue #2)

您好，请问entmax15和top-k是如何选择的呢？在您sparse_activated_multihead_attention.py代码中entmax和top-k是二选一的，在您测试的经验来看，两者各适用于什么情况？ — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you commented.Message ID: @.***>

Mar 08 '22 08:03 zhaoguangxiang

Thank you very much for your patient answer, which helps me a lot.

Mar 08 '22 08:03 z972778371