ImageBind icon indicating copy to clipboard operation
ImageBind copied to clipboard

No EOT when long sequence is truncated?

Open bakachan19 opened this issue 2 years ago • 0 comments

Hi. I noticed that when the input text sequence, truncation is performed to reduce the sequence to 77 tokens. However no EOT token is added at the end?

For example, in the case of a short text, I have the following tokenization with the EOT= 49407 as last token.

tensor([[49406,   518,  8809,   631,  5284,   620,   530,  7395, 12188,   267,
           593,   836,  6377,   531,   518,  2184,   537,  3326,   536,   518,
         10223,   539,   518,  1771,   269,   997,   631,   536,  3651,  2581,
          1047,  8626,   530,   518,  2867,   267,   836,  6765,   525,   911,
          8809,  1519,  3326,   631,  2862, 13314,   269,   518,  2117,  7290,
         32231,   530,   518,  5994,   267,  5524,   320, 24894, 10506,   556,
           911, 11251,   269, 49407,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], device='cuda:0')

But with longer sequences, I do not see any EOT=49407 added.

tensor([[49406,   518,  2867, 15305,   320, 19663,  6368,  1655,   593,  5560,
          1047, 14646,  1630,   320,  3638, 10297,   269,   997,   631,  6470,
          1047,   530,   518,  3562,   267,   836,  2862,  6377,   531,   518,
         35186,   267,  1519,  3326,  9308,   531, 24210,   320,   750, 18949,
          2445,   269,   320,  1876,  3309,   320, 11122, 12726,   525,   518,
         48812,   539,   320, 10297,   267,  2339,   518,  3562,   320,  1499,
           267,  2050,  1139, 10506,   269,   997,   631, 17082, 12033,  9729,
          6721,   267,  5256,   556,  6212,   541, 39306]], device='cuda:0')

Is this something intended? If so, what is the reasoning behind it?

I also noticed that I get the same embedding values for different text sequences that are bigger > 77 even though after tokenization I see different tokens being generated (but no EOT)....

Also, from my understanding (please correct me if I am wrong) ImageBind uses CLIP. However in the CLIP implementation the EOT is added when truncating a long sequence: https://github.com/openai/CLIP/blob/a1d071733d7111c9c014f024669f959182114e33/clip/clip.py#L239C1-L240C39

Any idea on what I am doing wrong?

Thanks.

bakachan19 avatar Jul 24 '23 13:07 bakachan19