transformers Pytorch BigBird random attention

Reproduction

Pytorch->Flax and Flax->Pytorch equivalence tests were failing. At the moment they are skipped by https://github.com/huggingface/transformers/pull/23040

Expected behavior

During working on https://github.com/huggingface/transformers/pull/21023 I have found out that there is a bug in pytorch's implementation of BigBird. Namely random attention is used no matter whether we are in training/eval mode. Corect behaviour is that during inference (eval) we should not introduce any randomness, hence we random attention should not be used.

Apr 28 '23 23:04 Bearnardd

Hi @sanchit-gandhi @ydshieh! I have opened PR that fixes failing tests. I am wondering if the changes in the PR are okay (usage of random attention based on current mode) or do we want to have some more control over usage of random attention e.g. add deterministic argument for __call__ of BigBirdPreTrainedModel. Secondly I was wondering what is the advantage of marking _bigbird_block_rand_mask as a staticmethod and then calling it with self._bigbird_block_rand_mask and passing it arguments from self like self.max_seqlen instead of treating it as a regular method. It looks kinda weird to me. Am I missing something?

Apr 29 '23 00:04 Bearnardd

Closed via https://github.com/huggingface/transformers/pull/23056.

May 30 '23 18:05 sanchit-gandhi