[WIP] Add ZeroShotObjectDetectionPipeline (#18445)
What does this PR do?
This PR adds the ZeroShotObjectDetectionPipeline. It is tested on OwlViTForObjectDetection model and should enable the inference following inference API
from transformers import pipeline
pipe = pipeline("zero-shot-object-detection")
pipe("cats.png", ["cat", "remote"])
This pipeline could default to the https://huggingface.co/google/owlvit-base-patch32 checkpoint
Fixes # (18445)
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [x] Did you read the contributor guideline, Pull Request section?
- [x] Was this discussed/approved via a Github issue or the forum? Link to the Issue
- [x] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [x] Did you write any new necessary tests?
Who can review?
@alaradirik @Narsil
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
The documentation is not available anymore as the PR was closed or merged.
Hi, just seeing the merge messed up the commit history. There are 377 changes, which is impossible for the review and merge the PR into main.
I suggest to reset to the last clean commit locally. Then use git rebase main to keep update with main (after pulling the latest changes from remote main into local main). Or any way works (as I am not sure what causes the current git status)
Hi, just seeing the merge messed up the commit history. There are 377 changes, which is impossible for the review and merge the PR into
main.I suggest to reset to the last clean commit locally. Then use
git rebase mainto keep update withmain(after pulling the latest changes from remotemaininto localmain). Or any way works (as I am not sure what causes the current git status)
Hi @ydshieh sorry for that. Was in a hurry to wrap the PR since I was going for vacation. Messed up in rebasing. Have reverted to stable commit. Will add the correct changes once I am back!
No problem, @sahamrit! I am super happy that you are able to get back to the stable commit 💯 . Have a nice vacation!
Hi @alaradirik , can you review the changes?
Thank you for this PR.
- I suggest to modify the output of the pipeline to be more "natural". (see relevant comment).
text_queriesshould be renamedcandidate_labelsto be in line withzero-shot-classification.
Hey @Narsil! I suggested using text_queries instead because it is a multi-modal model where users query images with free-form text. The queried object is either found or not and the found object's label is not chosen from a selection of candidate labels, so I think it'd make more sense to keep as it is.
Hey @Narsil! I suggested using text_queries instead because it is a multi-modal model where users query images with free-form text. The queried object is either found or not and the found object's label is not chosen from a selection of candidate labels, so I think it'd make more sense to keep as it is.
Are you sure ? I just tried your code, and it seems all the labels stem from the text being sent. Meaning I think there is a 1-1 correspondance between label and text_queries (meaning candidate_labels would be a fine name).
from transformers import pipeline
object_detector = pipeline(
"zero-shot-object-detection", model="hf-internal-testing/tiny-random-owlvit-object-detection"
)
outputs = object_detector(
"./tests/fixtures/tests_samples/COCO/000000039769.png",
text_queries=["aaa cat", "xx"],
threshold=0.64,
)
print(outputs)
Hi @Narsil, Sure the output labels are taken exactly from the input text_queries. The reason of naming it "text_queries" instead of "candidate_labels" as in case of zero-shot-image-classification is that, in zero-shot-image-classification pipeline, the [candidate labels are wrapped by the hypothesis template](https://github.com/huggingface/transformers/blob/main/src/transformers/pipelines/zero_shot_image_classification.py#:~:text=candidate_labels%20(%60List%5Bstr,logits_per_image ), whereas here the text_queries are free text queries!
Hope it clarifies
Are you sure ? I just tried your code, and it seems all the labels stem from the text being sent. Meaning I think there is a 1-1 correspondance between
labelandtext_queries(meaningcandidate_labelswould be a fine name).
Yes, there is a 1-1 correspondence but I meant only the query text / a single label is evaluated for each object, whereas the label is selected from among multiple candidate labels for zero-shot-classification.
Yes, there is a 1-1 correspondence but I meant only the query text / a single label is evaluated for each object, whereas the label is selected from among multiple candidate labels for zero-shot-classification.
I still think that zero-shot -> candidate_labels logic works. If we reuse names, it means that it's easier on users to discover and use pipelines. The fact that they are slightly different doesn't justify in my eyes the use of a different name.
I would even argue that they are exactly the same and the difference in how they are used are cause by classification vs object-detection not by what candidate_labels are.
I personally think using candidate_labels would be misleading and confusing given architecture and use case of this model. There have been other zero-shot object detection papers published very recently and it'd be better to get the naming right in order to avoid future breaking changes.
HI @Narsil @alaradirik, kindly review the changes