Calculate position ids in modeling utils for all generative models
What does this PR do?
As it was discussed under this PR, position ids in some models are not calculated/inferred from attn mask in forward, which gives incorrect positions when the inputs is left padded.
To be consistent and for ease of maintaining, the logic of inferring position ids is moved to "modeling_utils.py" and all generative models call that method in their forward and prepare_inputs_for_generation. I added two tests, to check whether model outputs are same when position ids are passed by a user vs. when inferred from input ids or embeds.
Also Fixes #29149.
The newly added tests are passing. Plus slow tests on vision models, because they still do not have GenerationTesterMixin.
Btw, I see that non-generative models already use create_position_ids_from_input_ids method which is copied separately in each model's file. The logic is a bit different from generative models, because they start counting from "padding_idx" and not "0". Anyway, I guess it is still possible to merge that method and the one proposed here, to have one "get_position_id" for all models in the "modeling_utils".
@gante WDYT ?
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
About the framework changes: I found that tf/flax has a slightly different way to get position_ids from torch models. Those frameworks generate position ids in forward without taking into account attention mask, same thing we had in torch before these fixes.
I made tf and flax same way as torch is now with a cumsum over attention mask, so that the equivalence over frameworks tests pass. I am not sure if we need similar test for tf/flax to "test_position_ids". Tests should pass now, at least locally it seemed okay
Okay, will work on it.
- TF has only 3 models for decoder-only so I thought we would not need it. Okay I can dd it in the same way
- These are the composite models for flax, I think it's needs a fix but could not find where yet
- Okay :)
@gante the comments are addressed now. TF cannot have the "get_position_ids" method in PretrainedModel because all input related preparations in TF happen in a "keras.layers.Layer" class. I am not sure if we can or should be moving the position_id preparation into the "PretrainedModel", since there are only 3 TF models that were needed change.
Also, to note for Flax-based encoder-decoder models: the attention mask for decoder part is overriden to be full, because when using decoder-only model as decoder part the position ids are calculated differently (I mean only the unattended part). In random initialized models it is causing logits mismatch, even if the attention masks out unattended positions. In pre-trained models that does not happen 🤔
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hold it for a while, not stale
Will reopen this one later, as a new PR. It will need resolving merge conflicts and propagating changes to new models + PR comments.