Question in concating the features

Open memesoo99 opened this issue 1 year ago • 0 comments

In the sample code provided, features are concated before processed in the encoder. features = torch.concat([video_tokenizer(video), audio_tokenizer(audio), time_series_tokenizer(time_data)],dim=1)

However, as I ran some tokenizers of different modaility, the tokenized shape is not identical. For example, image is tokenized as [B, HW, C] and text is tokenized as [B,1,C] where c is the embedding dimension 768.

How are we supposed to process this? Also, Is there a sample code using text_tokenizer? It seems like the text_encoder is loading the wrong tokenizer CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")

Apr 24 '24 07:04 memesoo99