XProNet Some questions about the prototype matrix

Hi, thanks for your code! I have some questions about the model. When we construct the prototype matrix(N_l x N_p x D), the 1xD vectors in it is derived from the whole image/sentence; However, when conducting subsequent operations of the Cross-modal Prototype Querying and the Cross-modal Prototype Responding, it is to look for the most suitable vector in the prototype matrix for each patch or word. Does this sound not so matching? image -patch, sentence - word?

Mar 22 '23 07:03 Eldo-rado

Hi, thanks for your interest.

The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief).

In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?)

Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training.

Hope this can help you figure out the problem

Mar 22 '23 10:03 Markin-Wang

Hi, thanks for your interest.

The cross-modal prototype matrix is to learn and record the cross-modal patterns for each class rather than the image/sentence features for some specific samples. Therefore, the cross-modal prototype is initialized from several clustered, concatenated features (high level statistics, extract the common pattern) instead of the representations from some specific samples. Note that those cross-modal patterns are designed for class-level rather than instance-level. Hence, they can be applied for both the images/sectences or tokens (fine-grained) within the same class. For example, some cross-modal patterns may guide the model how to decribe the content (style, detailed or brief).

In addition, some learned cross-modal patterns can also be fine-grained as samples within the class are grouped, hence each group may focus more on some parts of sentence or patches ( imagine you group ten different type of cars from the car category, what the model will fcous?)

Moreover, the initilization is to ensure that the cross-modal prototype matrix has a good semantic information at the begining. Through the design of cross-moal prototype quering and corresponding and the contrastive learning, the model will learn what patterns should be learned and recorded, and optimize the cross-modal matrix during the training.

Hope this can help you figure out the problem

Thank you for your reply. I almost understand. Can it be summarized in the following three points？

While the cluster is formed from instances, what it represents has risen to "categories".
In the clustering process, those instances/samples with large fine-grained similarity tend to cluster together. Thus, at initialization time, fine-grained information is included in the prototype.

3.This will be further optimized in subsequent training.

In addition, is there some ambiguity regarding the representation of r_j^s in the following figure in the paper? The subscript of r in Figure 1 represents a certain patches; The subscript of r in Figure 2 represents a certain sample. Maybe the r in Figure 2 should be bold? I don't know if I understand correctly.

Fig.1 8366646fcbc17953e584b17367802267 Fig.2 3d910f4bbad98c376c458fe140eaf4df

Mar 22 '23 11:03 Eldo-rado