Michael Gschwind
Michael Gschwind
@pytorchbot merge
@byjlw please assign this issue to somebody on your team to resolve. We missed the release cut on this, but let's stop doing this, without control.
torch.nn.MultiHeadAttention is defined to accept floats => https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html "For a float mask, the mask values will be added to the attention weight."
(at least) as early as November 2021, we issued warning about byte tensors being deprecated for torch.nn.MultiHeadAttention, e.g., here => https://github.com/pytorch/pytorch/issues/67999 ``` warnings.warn("Byte tensor for attn_mask in nn.MultiheadAttention is deprecated....
I recommend that we correct the documentation for MultiHeadAttention to reflect that byte masks were deprecated a while ago, e.g., here => https://pytorch.org/docs/stable/generated/torch.nn.MultiheadAttention.html as well as here => https://pytorch.org/docs/stable/generated/torch.nn.quantizable.MultiheadAttention.html#torch.nn.quantizable.MultiheadAttention.forward
It's slightly more complicated than this? because key_padding_mask might be either Boolean or Float. Are permutations allowed where one mask is Boolean and the other is FLoat, or should we...
Presumably you had a test case that demonstrates the problem? Can you please create a PR and submit it, so we can verify the PR fixes the issue?
Can you please provide more context as to what you see as the problem? Maybe we haven’t documented it in the doc strings clearly enough that floats are intended to...
I’ll remove it until we find whether it buys us performance (I’ve seen above additional improvements for cpu sdpa land from the Intel team land since I did my experiments)...
Waiting on a review which is required to merge. Addressed @Chillee 's feedback. If he's not available who else can review? @cpuhrsch @jisaacso ?