Verify compatibility between `Data2VecVision` models and existing retrievers
Context
- Part of #2418
- After the simplification of
language_model.pyandtokenization.py, adding new supported model types in Haystack has been heavily simplified - The entire framework is still oriented heavily towards question answering on text, and this assumption is embedded into the code in many parts of the stack
Goal
- Verify if any existing retriever can load a image retrieval model such as
Data2VecVisionwith minor changes along the way- If it can, consider a small refactoring to make the code paths more generic (change
get_tokenizerintoget_feature_extractorand so on) - If it cannot in its current state, even with minor adaptation, consider creating a separate
ImageRetrieverclass that can do that. Also evaluate if the underlying stack (Inferencer,Processor,AdaptiveModeletc) can be leveraged or not, and to which degree.
- If it can, consider a small refactoring to make the code paths more generic (change
An attempt to generalize TableTextRetriever to work with images quickly proved too complex for the scope of this issue.
Rather than modifying an existing Retriever with the risk of breaking working code, I opted for cloning TableTextRetriever and its stack of supporting classes and perform the changes needed to support N models rather than just 3 (query, text and tables).
The goal of this issue then changes to the following:
- Create a multi modal retriever called
MultiModalRetrieverby generalizing the concepts introduced byTableTextRetriever - It introduces a stack of new subclasses to support such retriever, such as:
-
MultiAdaptiveModel(fromTriAdaptiveModel) -
EmbeddingSimilarityHead(fromTextSimilarityHead) -
MultiModalSimilarityProcessor(fromTableTextSimilarityProcessor)
-
Note that this Retriever will NOT be tested for working in pipelines, but only to work in isolation. It will also, most likely, stay undocumented. See https://github.com/deepset-ai/haystack/issues/2418 for the rationale.
Continues in #2857