transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Dynamic Resolution Input for CLIP

Open Starlento opened this issue 2 years ago • 14 comments

What does this PR do?

Fixes # (issue)

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [ ] Did you read the contributor guideline, Pull Request section?
  • [ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Starlento avatar Nov 13 '23 07:11 Starlento

For VIT, it supports dynamic resolution: https://discuss.huggingface.co/t/fine-tuning-vit-with-more-patches-higher-resolution/18731. For Dinov2, it originally support this, so that I copy the code to CLIP model file. Maybe it should be a config which is setable to user. But I am not familar with transformers Config as I want to add a new member in the class...

For the performance, I just tried the example in hf model-card. For the original 224 settings, the prob is tensor([[0.9899, 0.0101]]. And for 448 settings, the prob is tensor([[0.9936, 0.0064]].

Starlento avatar Nov 13 '23 07:11 Starlento

And I am new to open-source community, so I apologize for not ticking anything above and any unproper behaviors... Feel free to comment and I am glad to learn.

Starlento avatar Nov 13 '23 07:11 Starlento

Thanks for adding @Starlento!

Could you protect this behaviour behind an interpolate_pos_encoding argument flag, as done in other models e.g. like here for ViT?

That's a good idea. But it is somewhat ugly as it is required to involve all the related functions... I just update the code and two doc strings.

Starlento avatar Nov 14 '23 11:11 Starlento

To resolve the code quality tests, run make fix-copies and then make fixup and push the changes

amyeroberts avatar Nov 15 '23 13:11 amyeroberts

To resolve the code quality tests, run make fix-copies and then make fixup and push the changes

Thank you for the suggestion. I just run the commands, and I found it is trying to add the interpolate functions to other CLIP (e.g. chinese_clip, x_clip) files. And I checked that the implementation is not complete so I checkout those files.

check_repository_consistency - Failed check_code_quality - Success

Should I also solve the check_repository_consistency thing?

Starlento avatar Nov 15 '23 13:11 Starlento

Should I also solve the check_repository_consistency thing?

@Starlento Yep, we want the changes CLIP to applied across to any models copying its architecture.

amyeroberts avatar Nov 15 '23 14:11 amyeroberts

Should I also solve the check_repository_consistency thing?

@Starlento Yep, we want the changes CLIP to applied across to any models copying its architecture.

The thing I least wanted to happen still happened... I checked that the fixup and fix-copies only matters the first CI test. And for the failed test_torch, I found no reference in the repo and do not think I can have an environment to test... Could you kindly provide a guide for certain situation?

Starlento avatar Nov 15 '23 14:11 Starlento

@Starlento Don't worry! The reason is that not all of CLIPs architecture will have been copied across for all models. Looking at the diff, you'll need to copy directly the interpolate_pos_encoding function + any additional argument passing to modeling_clipseg.py

amyeroberts avatar Nov 15 '23 20:11 amyeroberts

@Starlento Don't worry! The reason is that not all of CLIPs architecture will have been copied across for all models. Looking at the diff, you'll need to copy directly the interpolate_pos_encoding function + any additional argument passing to modeling_clipseg.py

Yes, you are right. Actually there was some network issue that I cannot access the CI log. I found for clipseg, it already has a similar function interpolate_position_embeddings, so I keep the original one.

Starlento avatar Nov 16 '23 04:11 Starlento

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 29 '24 08:02 github-actions[bot]

Hi @Starlento, any update on this PR? It would be great to add this to the library!

amyeroberts avatar Mar 01 '24 10:03 amyeroberts

Hi @Starlento, any update on this PR? It would be great to add this to the library!

Hi, sorry that I am quite busy recently. I would like to check the things to do. Could you kindly check whether my understanding is correct? Let's take altclip as example, 1. Docstrings for all models will need to be updated Does this mean I need to update ALTCLIP_TEXT_INPUTS_DOCSTRING in modeling_altclip.py? Is there any other files related to docstrings to be modified? 2. A test should be added for all models to make sure they can accept images of different sizes when interpolate_pos_encoding=True Does this mean I need to modify test_modeling_altclip.py while take test_modeling_vit.py as reference? 3. Once the tests are added, then the final step is to run the slow tests for all of these models to make sure their default behaviour is the same For running the tests, does that mean I need to execute the @slow tests locally on my machine which means download all pretrained models?

Starlento avatar Mar 01 '24 12:03 Starlento

@Starlento Re your questions:

  1. Docstrings for all models will need to be updated Does this mean I need to update ALTCLIP_TEXT_INPUTS_DOCSTRING in modeling_altclip.py? Is there any other files related to docstrings to be modified?

It means that for all models which have interpolate_pos_encoding added as an argument, the docstrings for the methods will need to be updated to include this argument. In the case of altclip, this would be updating ALTCLIP_VISION_INPUTS_DOCSTRING and ALTCLIP_INPUTS_DOCSTRING (which I can see you've doing in the PR). You don't need to update ALTCLIP_TEXT_INPUTS_DOCSTRING and there aren't any other places relating to the model you;d need to update either. This should be done for all the affected models, not just altclip.

  1. A test should be added for all models to make sure they can accept images of different sizes when interpolate_pos_encoding=True Does this mean I need to modify test_modeling_altclip.py while take test_modeling_vit.py as reference?

Yes, using test_inference_interpolate_pos_encoding as a reference specifically

  1. Once the tests are added, then the final step is to run the slow tests for all of these models to make sure their default behaviour is the same For running the tests, does that mean I need to execute the @slow tests locally on my machine which means download all pretrained models?

Yes, it means running the tests marked with @slow locally. This will download all the checkpoints specified in the integration tests.

amyeroberts avatar Mar 04 '24 13:03 amyeroberts

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 24 '24 08:05 github-actions[bot]