Michael Tschannen

Results 11 comments of Michael Tschannen

PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for...

This could be due to a mismatch in preprocessing. My first guess is that you're not lower-casing the texts, which gives the best results for EN-focused retrieval.

To reproduce the results using our code (i.e. with this codebase), you can select and download your target model in the [demo colab](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb) and run the code snippet below (you...

Hi SigLIP has a MAP head (attention pooling head) instead of a CLS token. You can try using the MAP head output (`pre_logits`) instead of the CLS token representation.

SigLIP 2 was trained with text length 64. The big_vision Gemma tokenizer implementation will pad/truncate to 64 if you set length=64. I'm not sure how other implementations behave (it seems...

This seems unrelated to big_vision and might be due to a flax update in the colab environment. You should be able work around it by adding `!pip3 install flax==0.8.5` after...

If you are using the big_vision implementation you can adapt the preprocessing to resize to 640 x 640 and the model will internally resize the positional embedding to the new...

You can increase `max_seq_length` to get higher resolution images after preprocessing. The maximum sequence length which NaFlex models were trained on is 1024. If you use the model zero-shot, you...

Not in the the SigLIP 2 tech report, but I saw several open-weight VLMs using it.

The text-to-image and image-to-text recall metrics are not identical because the embeddings of corresponding image and text are not identical. In the former the recall is computed for every text...