Michael Tschannen comments

Results 11 comments of


                                            Michael Tschannen

[Question] How to inference / captioning short video?

PaliGemma can process a stack of frames without architecture modifications. We also released preprocessing ops to subsample videos or extract frames with a fixed stride. There are fine-tuning configs for...

cannot achieve the claimed zero-shot retrieval results of SigLIP2 ViT/B16-224 on COCO and Flickr30K in your paper

This could be due to a mismatch in preprocessing. My first guess is that you're not lower-casing the texts, which gives the best results for EN-focused retrieval.

cannot achieve the claimed zero-shot retrieval results of SigLIP2 ViT/B16-224 on COCO and Flickr30K in your paper

To reproduce the results using our code (i.e. with this codebase), you can select and download your target model in the [demo colab](https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP2_demo.ipynb) and run the code snippet below (you...

siglip's [cls token]

Hi SigLIP has a MAP head (attention pooling head) instead of a CLS token. You can try using the MAP head output (`pre_logits`) instead of the CLS token representation.

max_length of Siglip2

SigLIP 2 was trained with text length 64. The big_vision Gemma tokenizer implementation will pad/truncate to 64 if you set length=64. I'm not sure how other implementations behave (it seems...

module 'jax.api_util' has no attribute 'debug_info'

This seems unrelated to big_vision and might be due to a flax update in the colab environment. You should be able work around it by adding `!pip3 install flax==0.8.5` after...

siglip or siglip2 image process

If you are using the big_vision implementation you can adapt the preprocessing to resize to 640 x 640 and the model will internally resize the positional embedding to the new...

A simple question about Image Resolution of NaFlex version

You can increase `max_seq_length` to get higher resolution images after preprocessing. The maximum sequence length which NaFlex models were trained on is 1024. If you use the model zero-shot, you...

The performance of NaFlex for VLMs

Not in the the SigLIP 2 tech report, but I saw several open-weight VLMs using it.

Siglip2 Paper: Why recall is different "text to image retrieval" between "image to text retrieval" in Figure2?

The text-to-image and image-to-text recall metrics are not identical because the embeddings of corresponding image and text are not identical. In the former the recall is computed for every text...