Some questions about the prototype matrix
Hi, thanks for your code! I have some questions about the model. When we construct the prototype matrix(N_l x N_p x D), the 1xD vectors in it is derived from the whole image/sentence; However, when conducting subsequent operations of the Cross-modal Prototype Querying and the Cross-modal Prototype Responding, it is to look for the most suitable vector in the prototype matrix for each patch or word. Does this sound not so matching? image -patch, sentence - word?
Hi, thanks for your interest.
The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief).
In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?)
Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training.
Hope this can help you figure out the problem
Hi, thanks for your interest.
The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief).
In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?)
Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training.
Hope this can help you figure out the problem
Thank you for your reply. I almost understand. Can it be summarized in the following three points?
-
While the cluster is formed from instances, what it represents has risen to "categories".
-
In the clustering process, those instances/samples with large fine-grained similarity tend to cluster together. Thus, at initialization time, fine-grained information is included in the prototype.
3.This will be further optimized in subsequent training.
In addition, is there some ambiguity regarding the representation of r_j^s in the following figure in the paper? The subscript of r in Figure 1 represents a certain patches; The subscript of r in Figure 2 represents a certain sample. Maybe the r in Figure 2 should be bold? I don't know if I understand correctly.
Fig.1
Fig.2
