open_clip Dimension mismatch when using Coca for VQA task

I use generate endpoint to do VQA task in Coca model, but got this error:

It seems that this issue will not happen in beam_search mode but appear in top_k or top_p mode.

Also, when I change max_seq_len parameter in generate I got different outputs. For example: max_seq_len = 20 and generation_type = top_p will not raise this error message. However this will not work for max_seq_len = 78 and generation_type = top_p.

Am I use this in a wrong way?

Apr 29 '23 03:04 jemmyshin

Hi @jemmyshin, I think there was an issue similar to this one that was fixed some time ago, any chance that you are using an older version? Otherwise this is a bug, I will check what the issue is.

Apr 29 '23 07:04 gpucce

I used the code in Coca Colab so it should be 2.18.0

May 01 '23 08:05 jemmyshin

Hi @jemmyshin, so indeed there is a little bug in some sense, however you can probably already do what you want, if I understand it without any changes in the codebase. In the meantime I will open a PR.

The reason that a longer max_seq_len throws an error is that the model is trained with a context length of 77 and it has a special token so using 76, the default, or less should be the way to go. However, what that parameter affects is only the context the model uses to generate not the length of the generation.

If I understand you are not getting an answer after your prompt, the reason for that is the tokenizer. if you replace text = ... with

text = open_clip.tokenize(["Question: what is the color of this billboard? Answer:"])
text = text[:, :torch.where(text == 0)[1][0] - 1]

you should get the answer after the prompt, the issue is that the tokenizer adds padding and end of text token by default, I will make a pr to fix this but you should be able to try with this already. Let me know if this actually works!

May 04 '23 08:05 gpucce

Yes, that works for single batch, but probably not for batch_size > 1 since each question may have different length. Also, the output somehow concatenate the prompt and answer:

Is there a way to separate them automatically (if input text is not None)?

May 05 '23 03:05 jemmyshin

@jemmyshin Hi, can you share the full code of VQA in coca? Thanks!

Feb 08 '24 02:02 LixDemon