What does this PR do?

This PR adds the ZeroShotObjectDetectionPipeline. It is tested on OwlViTForObjectDetection model and should enable the inference following inference API

from transformers import pipeline

pipe = pipeline("zero-shot-object-detection")
pipe("cats.png", ["cat", "remote"])

This pipeline could default to the https://huggingface.co/google/owlvit-base-patch32 checkpoint

Fixes # (18445)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline, Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Link to the Issue
[x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?

@alaradirik @Narsil

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

Sep 07 '22 22:09 sahamrit

The documentation is not available anymore as the PR was closed or merged.

Sep 07 '22 22:09 HuggingFaceDocBuilderDev

Hi, just seeing the merge messed up the commit history. There are 377 changes, which is impossible for the review and merge the PR into main.

I suggest to reset to the last clean commit locally. Then use git rebase main to keep update with main (after pulling the latest changes from remote main into local main). Or any way works (as I am not sure what causes the current git status)

Sep 23 '22 14:09 ydshieh

Hi, just seeing the merge messed up the commit history. There are 377 changes, which is impossible for the review and merge the PR into main.

I suggest to reset to the last clean commit locally. Then use git rebase main to keep update with main (after pulling the latest changes from remote main into local main). Or any way works (as I am not sure what causes the current git status)

Hi @ydshieh sorry for that. Was in a hurry to wrap the PR since I was going for vacation. Messed up in rebasing. Have reverted to stable commit. Will add the correct changes once I am back!

Sep 23 '22 17:09 sahamrit

No problem, @sahamrit! I am super happy that you are able to get back to the stable commit 💯 . Have a nice vacation!

Sep 23 '22 17:09 ydshieh

Hi @alaradirik , can you review the changes?

Oct 05 '22 04:10 sahamrit

Thank you for this PR.

I suggest to modify the output of the pipeline to be more "natural". (see relevant comment).

text_queries should be renamed candidate_labels to be in line with zero-shot-classification.

Hey @Narsil! I suggested using text_queries instead because it is a multi-modal model where users query images with free-form text. The queried object is either found or not and the found object's label is not chosen from a selection of candidate labels, so I think it'd make more sense to keep as it is.

Oct 06 '22 10:10 alaradirik

Hey @Narsil! I suggested using text_queries instead because it is a multi-modal model where users query images with free-form text. The queried object is either found or not and the found object's label is not chosen from a selection of candidate labels, so I think it'd make more sense to keep as it is.

Are you sure ? I just tried your code, and it seems all the labels stem from the text being sent. Meaning I think there is a 1-1 correspondance between label and text_queries (meaning candidate_labels would be a fine name).

from transformers import pipeline

object_detector = pipeline(
    "zero-shot-object-detection", model="hf-internal-testing/tiny-random-owlvit-object-detection"
)

outputs = object_detector(
    "./tests/fixtures/tests_samples/COCO/000000039769.png",
    text_queries=["aaa cat", "xx"],
    threshold=0.64,
)
print(outputs)

Oct 06 '22 10:10 Narsil

Hi @Narsil, Sure the output labels are taken exactly from the input text_queries. The reason of naming it "text_queries" instead of "candidate_labels" as in case of zero-shot-image-classification is that, in zero-shot-image-classification pipeline, the [candidate labels are wrapped by the hypothesis template](https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/zero_shot_image_classification.py#:~:text=candidate_labels%20(%60List%5Bstr,logits_per_image ), whereas here the text_queries are free text queries!

Hope it clarifies

Oct 06 '22 10:10 sahamrit

Are you sure ? I just tried your code, and it seems all the labels stem from the text being sent. Meaning I think there is a 1-1 correspondance between label and text_queries (meaning candidate_labels would be a fine name).

Yes, there is a 1-1 correspondence but I meant only the query text / a single label is evaluated for each object, whereas the label is selected from among multiple candidate labels for zero-shot-classification.

Oct 06 '22 10:10 alaradirik

Yes, there is a 1-1 correspondence but I meant only the query text / a single label is evaluated for each object, whereas the label is selected from among multiple candidate labels for zero-shot-classification.

I still think that zero-shot -> candidate_labels logic works. If we reuse names, it means that it's easier on users to discover and use pipelines. The fact that they are slightly different doesn't justify in my eyes the use of a different name. I would even argue that they are exactly the same and the difference in how they are used are cause by classification vs object-detection not by what candidate_labels are.

I personally think using candidate_labels would be misleading and confusing given architecture and use case of this model. There have been other zero-shot object detection papers published very recently and it'd be better to get the naming right in order to avoid future breaking changes.

Oct 06 '22 11:10 Narsil

HI @Narsil @alaradirik, kindly review the changes

Oct 07 '22 08:10 sahamrit

[WIP] Add ZeroShotObjectDetectionPipeline (#18445)

What does this PR do?

Before submitting

Who can review?