mechigonft comments

Results 58 comments of


                                            mechigonft

code_string和later_code的输入长度有限制吗？

你好，我的意思是在使用你们的底座模型进行加训/sft时，我构造训练集，code_string和later_code的输入长度的限制是多长？是跟你们预训练的32K保持一致吗？还有就是32K指的是token的个数吗？3.2w个token？有什么脚本可以判断我的训练集的token长度吗？我想过滤一下，不然会影响我的训练效果

请教两个细节问题

reranker的基座模型是：xlm-roberta-base，有什么推荐的，比较好的中文bert类模型，用来替换它然后重新预训练reranker模型？

训练日志弹出：Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

@staoxiao beg-embedding

训练日志弹出：Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

![special](https://github.com/FlagOpen/FlagEmbedding/assets/90537707/f3026918-73f0-48ea-bf36-29627a60b302)

训练日志弹出：Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

模型加载，推理时也会报

为什么问这个问题，因为我按照上述的逻辑过滤了一次我自己的训练数据后，然后开始训练，仍然会弹出少量的警告信息：Be aware, overflowing tokens are not returned for the setting you have chosen, i.e. sequence pairs with the 'longest_first' truncation strategy. So the returned list will always be empty even...

reranker的512token计算问题，确认一下

不是的哦，我对query+pos和query+neg都分别计算了

微软利用LLM生成embedding已经发布了相关论文和模型，在英文榜单取得SOTA，期待LLaRA

在英文榜单上，intfloat/e5-mistral-7b-instruct大幅领先第二名2分以上（平均值），这还不叫“取得特别大的领先分数”吗？第二名领先第三名也才0.15分