transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Need support for Sentence Similarity Pipeline

Open timxieICN opened this issue 2 years ago • 3 comments

Feature request

HuggingFace now has a lot of Sentence Similarity models, but the pipeline does not yet support this: https://huggingface.co/docs/transformers/main_classes/pipelines

Motivation

HuggingFace now has a lot of Sentence Similarity models, but the pipeline does not yet support this: https://huggingface.co/docs/transformers/main_classes/pipelines

Your contribution

I can write a PR, but might need some one else's help.

timxieICN avatar Apr 21 '23 14:04 timxieICN

cc @Narsil

amyeroberts avatar Apr 21 '23 15:04 amyeroberts

Hi @timxieICN ,

Thanks for the suggestion. In general, sentence-similarity like https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 are served by SentenceTransformers which is a library on top of transformers itself.

https://huggingface.co/sentence-transformers

Sentence transformers adds a few configuration specifically on how to do similarity with a given model as there's several ways to do it.

From a user point of view it should be relatively easy to do this:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    model_id
)

embeddings1 = model.encode(
    inputs["source_sentence"], convert_to_tensor=True
)
embeddings2 = model.encode(inputs["sentences"], convert_to_tensor=True)
similarities = util.pytorch_cos_sim(embeddings1, embeddings2)

This is exactly the code that is actually running to calculate those on the hub currently: https://github.com/huggingface/api-inference-community/blob/main/docker_images/sentence_transformers/app/pipelines/sentence_similarity.py

Adding this directly in transformers would basically mean incorporating sentence-transformers within transformers and I'm not sure it's something desired. Maybe @amyeroberts or another core maintainer can confirm/infirm this.

Does this help ?

Narsil avatar Apr 21 '23 15:04 Narsil

We definitely don't want a circular dependency like that!

As the example you shared @Narsil is so simple, I think it's a good replacement for a pipeline. Let's leave this issue open and if there's a lot of interest or new use case we can consider other possible options.

amyeroberts avatar Apr 21 '23 17:04 amyeroberts

Hi @timxieICN ,

Thanks for the suggestion. In general, sentence-similarity like https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 are served by SentenceTransformers which is a library on top of transformers itself.

https://huggingface.co/sentence-transformers

Sentence transformers adds a few configuration specifically on how to do similarity with a given model as there's several ways to do it.

From a user point of view it should be relatively easy to do this:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer(
    model_id
)

embeddings1 = model.encode(
    inputs["source_sentence"], convert_to_tensor=True
)
embeddings2 = model.encode(inputs["sentences"], convert_to_tensor=True)
similarities = util.pytorch_cos_sim(embeddings1, embeddings2)

This is exactly the code that is actually running to calculate those on the hub currently: https://github.com/huggingface/api-inference-community/blob/main/docker_images/sentence_transformers/app/pipelines/sentence_similarity.py

Adding this directly in transformers would basically mean incorporating sentence-transformers within transformers and I'm not sure it's something desired. Maybe @amyeroberts or another core maintainer can confirm/infirm this.

Does this help ?

Hi @Narsil, this is api of sentence transformer, I want to use sentence similarity of T5 model. So how to do that?

Thank you

viethoang303 avatar Nov 06 '23 15:11 viethoang303

I think that measuring distance between elements provided, by any embedding generation model, would be desirable indeed, I'm open to try and help if you want to do that.

wilmeragsgh avatar Nov 21 '23 20:11 wilmeragsgh