transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Calculate position ids in modeling utils for all generative models

Open zucchini-nlp opened this issue 1 year ago • 6 comments

What does this PR do?

As it was discussed under this PR, position ids in some models are not calculated/inferred from attn mask in forward, which gives incorrect positions when the inputs is left padded.

To be consistent and for ease of maintaining, the logic of inferring position ids is moved to "modeling_utils.py" and all generative models call that method in their forward and prepare_inputs_for_generation. I added two tests, to check whether model outputs are same when position ids are passed by a user vs. when inferred from input ids or embeds.

Also Fixes #29149.

The newly added tests are passing. Plus slow tests on vision models, because they still do not have GenerationTesterMixin.

Btw, I see that non-generative models already use create_position_ids_from_input_ids method which is copied separately in each model's file. The logic is a bit different from generative models, because they start counting from "padding_idx" and not "0". Anyway, I guess it is still possible to merge that method and the one proposed here, to have one "get_position_id" for all models in the "modeling_utils". @gante WDYT ?

zucchini-nlp avatar Apr 04 '24 17:04 zucchini-nlp

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

About the framework changes: I found that tf/flax has a slightly different way to get position_ids from torch models. Those frameworks generate position ids in forward without taking into account attention mask, same thing we had in torch before these fixes.

I made tf and flax same way as torch is now with a cumsum over attention mask, so that the equivalence over frameworks tests pass. I am not sure if we need similar test for tf/flax to "test_position_ids". Tests should pass now, at least locally it seemed okay

zucchini-nlp avatar Apr 18 '24 14:04 zucchini-nlp

Okay, will work on it.

  1. TF has only 3 models for decoder-only so I thought we would not need it. Okay I can dd it in the same way
  2. These are the composite models for flax, I think it's needs a fix but could not find where yet
  3. Okay :)

zucchini-nlp avatar Apr 22 '24 18:04 zucchini-nlp

@gante the comments are addressed now. TF cannot have the "get_position_ids" method in PretrainedModel because all input related preparations in TF happen in a "keras.layers.Layer" class. I am not sure if we can or should be moving the position_id preparation into the "PretrainedModel", since there are only 3 TF models that were needed change.

Also, to note for Flax-based encoder-decoder models: the attention mask for decoder part is overriden to be full, because when using decoder-only model as decoder part the position ids are calculated differently (I mean only the unattended part). In random initialized models it is causing logits mismatch, even if the attention masks out unattended positions. In pre-trained models that does not happen 🤔

zucchini-nlp avatar Apr 29 '24 12:04 zucchini-nlp

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 24 '24 08:05 github-actions[bot]

Hold it for a while, not stale

zucchini-nlp avatar May 24 '24 08:05 zucchini-nlp

Will reopen this one later, as a new PR. It will need resolving merge conflicts and propagating changes to new models + PR comments.

zucchini-nlp avatar Jun 28 '24 06:06 zucchini-nlp