FlagEmbedding Fine-tune other models + pooling & vector normalization question

Hi -

First of all thanks for this great code base, it's really helpful.

I've been trying to use these scripts for fine-tuning models other than BGE (e5-multilingual, I need a more multilingual model & tokenizer) but the performance doesn't seem to be that good on my basic sentenceTransformers eval script.

I suspected the default CLS reprensation to be at fault (I'm wondering why is this the default setting rather than mean pooling ?) but I'm not sure, my test don't produce that much difference.

Also I'm wondering if it may be linked to the normlize vector parameter ? It's not a default on other frameworks, also wondering why that seems to be the case here.

Thank you by advance for your insights !

Baudouin

Nov 17 '23 06:11 netapy

Thanks for your interest in our work! For e5, you should set --sentence_pooling_method mean due to the pooling method of e5 is mean pooling. And it also needs to be normalized, because the similarity score of e5 also is cosine.

Nov 17 '23 10:11 staoxiao

Great - thanks for the clear answer. Could you elaborate on why you guys chose the CLS token representation rather than mean pooling ? The latter always seems to produce better result in the papers I've read. Thanks !

Nov 20 '23 13:11 netapy

During pre-training, we use the cls pooling to represent the sentence, so we use the same pooling method in fine-tuning.

Nov 21 '23 09:11 staoxiao

During pre-training, we use the cls pooling to represent the sentence, so we use the same pooling method in fine-tuning.

Yep I got that — but why did you guys choose CLS pooling during pre-training ? I suppose it led to better accuracy ? I'd love to hear more about that ! I would have loved a research paper for BGE, it's really good ;)

Nov 21 '23 10:11 netapy

We just follow the previous settings and have not try to use mean pooling in pre-training.

Nov 22 '23 03:11 staoxiao