What does this PR do?

Implements SDPA for OWL VIT.

Fixes #28103

Before submitting

[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[] Did you write any new necessary tests?

Who can review?

@vasqu @younesbelkada

Nov 10 '25 23:11 Aravind-11

What does this PR do?

Implements SDPA for OWL VIT. Revamp of #28818

Fixes #28103

Before submitting

[x] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.

[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.

[] Did you write any new necessary tests?

Who can review?

@vasqu @younesbelkada

I ran the RUN_SLOW=1 python -m pytest tests/models/owlvit/test_modeling_owlvit.py for the original owlvit implementation and it seemed to fail the same tests as my current implementation. I'm not sure how to infer from that.

Nov 11 '25 00:11 Aravind-11

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

Got it. Thanks a lot!

Nov 11 '25 18:11 Aravind-11

Sorry but I've got to be strict about this. We no longer implement separate classes for all the attention flavors but one unified one. I think ViT is a good example in this case, e.g. see https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py

Before changing this to these standards I won't take a proper look for now.

I made similar changes as in the vit and removed the seperate sdpa class. Let me know what you think!

Nov 12 '25 06:11 Aravind-11

@Aravind-11 it seems like ~180 tests are failing, so there might be some really serious regressions. Don't worry about the copies, functionality is more important first.

If pixel values are needed for example, I would first check if they were needed before. If yes, see what changed, if not, the inputs preparation may need to change in the test file.

Nov 17 '25 13:11 vasqu

@Aravind-11 it seems like ~180 tests are failing, so there might be some really serious regressions. Don't worry about the copies, functionality is more important first.

If pixel values are needed for example, I would first check if they were needed before. If yes, see what changed, if not, the inputs preparation may need to change in the test file.

Hi, @vasqu , thank you for the comments! I reverted changes in owlv2 back to original for easier debugging. Only the pixel_values() seem to be causing issues in OWLVIT as per the tests.

The pixel_values are needed in the original implementation in ObjectionDetection.forward() too.

def forward( self, input_ids: torch.Tensor, pixel_values: torch.FloatTensor, attention_mask: Optional[torch.Tensor] = None, output_attentions: Optional[bool] = None, output_hidden_states: Optional[bool] = None, interpolate_pos_encoding: bool = False, return_dict: Optional[bool] = None, )

this is the original forward signature. Please let me know what you think. Thanks!

Nov 17 '25 15:11 Aravind-11

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT https://github.com/huggingface/transformers/blob/2cc9152da08dc9a2d1ac5e738b8eb764ca369717/tests/models/vit/test_modeling_vit.py#L173-L181

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

Nov 17 '25 16:11 vasqu

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

https://github.com/huggingface/transformers/blob/2cc9152da08dc9a2d1ac5e738b8eb764ca369717/tests/models/vit/test_modeling_vit.py#L173-L181

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

def prepare_config_and_inputs_for_common(self): config_and_inputs = self.prepare_config_and_inputs() config, pixel_values, input_ids, attention_mask = config_and_inputs inputs_dict = { "pixel_values": pixel_values, "input_ids": input_ids, "attention_mask": attention_mask, } return config, inputs_dict

this is the current prepare_config_and_inputs_for_common() for OwlViTForObjectDetectionTester. is this overrriden ?

Have you checked for a correct prepare_config_and_inputs_for_common as in ViT

https://github.com/huggingface/transformers/blob/2cc9152da08dc9a2d1ac5e738b8eb764ca369717/tests/models/vit/test_modeling_vit.py#L173-L181

This might cause these mismatches as we use that for generation specific tests (sdpa to eager equivalence included). I.e. it needs to be converted to a dict for us to pass as kwargs directly

def prepare_config_and_inputs_for_common(self):
      config_and_inputs = self.prepare_config_and_inputs()
      config, input_ids, attention_mask, pixel_values = config_and_inputs
      inputs_dict = {
          "pixel_values": pixel_values,
          "input_ids": input_ids,
          "attention_mask": attention_mask,
          "return_loss": False,
      }
      return config, inputs_dict

yes, the pixel_values are passed for the vision model tester.

Nov 17 '25 16:11 Aravind-11

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong :cry:

Nov 18 '25 16:11 vasqu

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Got it! No worries. Let me know if you find something. Thanks.

Nov 18 '25 16:11 Aravind-11

Sorry have a bit more todo atm, would need to debug myself! This is weird, but it does seem something in the preparation seems to go wrong 😢

Got it! No worries. Let me know if you find something. Thanks.

I have a doubt .. @vasqu I noticed some models still have output_attention and return_dict() in their forward signatures inspite of having **kwargs --> MPT, T5, Paligemma, pix2struct.. why is that ? and should all models have sdpa, flash and flex attn enabled?

Nov 25 '25 06:11 Aravind-11

It's connected but not necessary, the attention interface allows the usage of all attentions (fa, sdpa, flex). The signature along output_xxx, return_dict is a different matter and is handled via decorators such as can_return_dict or check_model_inputs. The latter needs _can_record_outputs to be properly adjusted as attribute of that model.

tl;dr: separate things but usually you need the attention interface first and then the decorators. There are cases where it's possible that only one has been applied.

Nov 25 '25 10:11 vasqu

It's connected but not necessary, the attention interface allows the usage of all attentions (fa, sdpa, flex). The signature along output_xxx, return_dict is a different matter and is handled via decorators such as can_return_dict or check_model_inputs. The latter needs _can_record_outputs to be properly adjusted as attribute of that model.

tl;dr: separate things but usually you need the attention interface first and then the decorators. There are cases where it's possible that only one has been applied.

ohh okay , got it! Makes sense. Thank you!

Nov 25 '25 17:11 Aravind-11

hii @vasqu! 😁 , were you able to look into this? thank you!

Dec 08 '25 18:12 Aravind-11

Sorry didnt have time yet, can you also check/debug what happens?

Dec 12 '25 15:12 vasqu

Sorry didnt have time yet, can you also check/debug what happens?

no worries, I'll take a look!

Dec 12 '25 20:12 Aravind-11

Hi @vasqu 😄 ,

I traced the issue with the failing SDPA padding test (test_eager_matches_sdpa_inference_24_fp32_pad_left_output_attentions) and found that the shared SDPA test helper in test_modeling_common.py was dropping required inputs for multi-modal models. Specifically, the helper only preserved main_input_name and additional_model_inputs, which caused OWL-ViT to lose input_ids during the SDPA inference path.

I added a small change to the helper so that it now preserves all remaining keys from inputs_dict (without overriding the special-cased ones).

After applying this fix, the previously failing SDPA test for OWL-ViT passes. However, this change breaks approximately 280 tests for other models with errors like "RuntimeError: to - 1 is out of bounds for bool" and dimension mismatches. It seems the test framework deliberately excludes or modifies certain inputs, and my fix adds back the original unmodified versions, causing conflicts.

Could you advise on the best approach here?

Thanks!

Dec 17 '25 02:12 Aravind-11

Specifically, the helper only preserved main_input_name and additional_model_inputs, which caused OWL-ViT to lose input_ids during the SDPA inference path.

Could it be that we need to change the main input name?

I added a small change to the helper so that it now preserves all remaining keys from inputs_dict (without overriding the special-cased ones).

This seems a bit too specific to owlvit in this case, so I'd rather we override it in the owlvit test instead of affecting all other models.

Could you advise on the best approach here?

Try to find if we can change certain settings so that it works out of the box, e.g. changing main_input_name.
Override it specifically in the test file of owl vit, not in the common file. Like you noticed, it breaks other models

Dec 17 '25 10:12 vasqu

thank u thank u , let me see what I can do 😄

Dec 17 '25 17:12 Aravind-11

[For maintainers] Suggested jobs to run (before merge)

run-slow: owlvit

Dec 17 '25 21:12 github-actions[bot]

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=42136&sha=caacc5

Dec 17 '25 21:12 github-actions[bot]

Sdpa for owlvit

What does this PR do?

Before submitting

Who can review?

What does this PR do?

Before submitting

Who can review?