Investigate performance discrepancies in gte-Qwen and NV-embed models
From https://github.com/embeddings-benchmark/mteb/pull/1436
Hello,
I conducted a comparison of the models using the examples provided in the readme.md file for each model. Here's a summary of my findings:
-
Alibaba-NLP/gte-Qwen2-7B-instruct
-
Alibaba-NLP/gte-Qwen1.5-7B-instruct
-
Alibaba-NLP/gte-Qwen2-1.5B-instruct
-
Linq-AI-Research/Linq-Embed-Mistral
For these models, I found that all three implementations (i.e., Transformers AutoModel, sentence_transformers, and mteb) are exactly the same. This consistency is great to see.
-
nvidia/NV-Embed-v2
-
nvidia/NV-Embed-v1
In these cases, the official implementation of Transformers AutoModel differs from the official sentence_transformers implementation, which is unexpected. The implementation in mteb aligns completely with sentence_transformers.
I also wanted to share the code I used for this comparison: View the Gist
Please note that questions regarding the correctness of prompt usage were not within the scope of this study. However, it does highlight that the models added to mteb are correctly implemented.
P.S. I created a discussion in the nvidia repository about this problem
Qwen model repository includes a script to calculate scores for their models on the MTEB benchmark. I ran this script on the same tasks covered in my pull request.
The results from the original script are, in most cases, worse than those reported on the leaderboard and also fall short when compared to results obtained using the code from the MTEB models. Here is my command to run this script
OPENBLAS_NUM_THREADS=8 python scripts/eval_mteb.py -m Alibaba-NLP/gte-Qwen2-1.5B-instruct --output_dir results_qwen_2_1.5b_eval_mteb --task mteb
Additionally, there is a open discussion about this on the Qwen model repository.
Classification
| source | AmazonCounterfactualClassification | EmotionClassification | ToxicConversationsClassification | |
|---|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 83.16 | 54.53 | 78.75 |
| gte-Qwen1.5-7B-instruct | Pull request | 81.78 | 54.91 | 77.25 |
| gte-Qwen1.5-7B-instruct | Original script | 67.87 | 46.08 | 59.06 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 83.99 | 61.37 | 82.66 |
| gte-Qwen2-1.5B-instruct | Pull request | 82.51 | 65.66 | 84.54 |
| gte-Qwen2-1.5B-instruct | Original script | 71.81 | 54.56 | 65.1 |
Clustering
| source | ArxivClusteringS2S | RedditClustering | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 51.45 | 73.37 |
| gte-Qwen1.5-7B-instruct | Pull request | 53.57 | 80.12 |
| gte-Qwen1.5-7B-instruct | Original script | 47.88 | 64.43 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 45.01 | 55.82 |
| gte-Qwen2-1.5B-instruct | Pull request | 44.61 | 51.36 |
| gte-Qwen2-1.5B-instruct | Original script | 41.1 | 52.53 |
PairClassification
| source | SprintDuplicateQuestions | TwitterSemEval2015 | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 96.07 | 79.36 |
| gte-Qwen1.5-7B-instruct | Pull request | 94.51 | 80.72 |
| gte-Qwen1.5-7B-instruct | Original script | 91.44 | 61.92 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 95.32 | 79.64 |
| gte-Qwen2-1.5B-instruct | Pull request | 91.19 | 75.93 |
| gte-Qwen2-1.5B-instruct | Original script | 93.87 | 74.59 |
Reranking
| source | SciDocsRR | AskUbuntuDupQuestions | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 87.89 | 66 |
| gte-Qwen1.5-7B-instruct | Pull request | 88.26 | 64.03 |
| gte-Qwen1.5-7B-instruct | Original script | 85.2 | 57.32 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 86.52 | 64.55 |
| gte-Qwen2-1.5B-instruct | Pull request | 85.67 | 62.33 |
| gte-Qwen2-1.5B-instruct | Original script | 83.51 | 60.47 |
Retrieval
| source | SCIDOCS | SciFact | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 27.69 | 75.31 |
| gte-Qwen1.5-7B-instruct | Pull request | 26.34 | 75.8 |
| gte-Qwen1.5-7B-instruct | Original script | 22.38 | 74.34 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 24.98 | 78.44 |
| gte-Qwen2-1.5B-instruct | Pull request | 23.4 | 77.47 |
| gte-Qwen2-1.5B-instruct | Original script | 21.92 | 75.81 |
STS
| source | STS16 | STSBenchmark | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 86.39 | 87.35 |
| gte-Qwen1.5-7B-instruct | Pull request | 85.98 | 86.86 |
| gte-Qwen1.5-7B-instruct | Original script | 81.33 | 83.65 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 85.45 | 86.38 |
| gte-Qwen2-1.5B-instruct | Pull request | 84.71 | 84.71 |
| gte-Qwen2-1.5B-instruct | Original script | 85.35 | 86.04 |
Summarization
| source | SummEval | |
|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 31.46 |
| gte-Qwen1.5-7B-instruct | Pull request | 31.22 |
| gte-Qwen1.5-7B-instruct | Original script | 30.07 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 31.17 |
| gte-Qwen2-1.5B-instruct | Pull request | 30.5 |
| gte-Qwen2-1.5B-instruct | Original script | 28.99 |
Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.
@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them
Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.
@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them
I'm a member of the gte-Qwen series model. Sorry, we checked and found some errors in the previous script. It have now been updated and verified to be consistent with the results on the leaderboard. Please try again with the latest script to check the results.
@afalf, thanks a lot! I've run the gte-Qwen models with the updated script and will post as soon as I have results
@afalf, I have reviewed the updated script and noticed a few minor errors that were preventing it from running. I plan to submit a pull request to your Hugging Face repository later. After correcting these issues, the results have been very promising. In fact, when employing normalization - which I regrettably forgot to include last time, despite it being used in the example -the metrics slightly surpass those on the leaderboard. Could you please clarify whether the intended execution is with or without normalization?
Additionally, I compared the script with the code in mteb/gritlm and identified some differences. I have managed to adjust the model in mteb to produce results almost identical to those of the original script. You will find the corrections in my pull request. #1637
Here is average scores
| leaderboard | Original script | Original script normalized | Pull request | |
|---|---|---|---|---|
| gte-Qwen1.5-7B-instruct | 69.9129 | 69.5629 | 70.1543 | 69.5436 |
| gte-Qwen2-1.5B-instruct | 68.6643 | 68.33 | 68.74 | 68.6293 |
Classification
| source | AmazonCounterfactualClassification | EmotionClassification | ToxicConversationsClassification | |
|---|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 83.16 | 54.53 | 78.75 |
| gte-Qwen1.5-7B-instruct | Pull request | 81.51 | 55.34 | 76.44 |
| gte-Qwen1.5-7B-instruct | Original script | 81.79 | 49.3 | 73.88 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 81.49 | 55.35 | 76.46 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 83.99 | 61.37 | 82.66 |
| gte-Qwen2-1.5B-instruct | Pull request | 85.81 | 64.67 | 82.93 |
| gte-Qwen2-1.5B-instruct | Original script | 84.04 | 61.04 | 82.29 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 85.82 | 64.68 | 82.94 |
Clustering
| source | ArxivClusteringS2S | RedditClustering | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 51.45 | 73.37 |
| gte-Qwen1.5-7B-instruct | Pull request | 53.16 | 80.14 |
| gte-Qwen1.5-7B-instruct | Original script | 53.17 | 80.06 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 53.16 | 80.03 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 45.01 | 55.82 |
| gte-Qwen2-1.5B-instruct | Pull request | 44.96 | 55.78 |
| gte-Qwen2-1.5B-instruct | Original script | 45.05 | 56.06 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 45.02 | 55.72 |
PairClassification
| source | SprintDuplicateQuestions | TwitterSemEval2015 | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 96.07 | 79.36 |
| gte-Qwen1.5-7B-instruct | Pull request | 94.96 | 80.94 |
| gte-Qwen1.5-7B-instruct | Original script | 94.98 | 80.94 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 94.96 | 80.95 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 95.32 | 79.64 |
| gte-Qwen2-1.5B-instruct | Pull request | 95.77 | 79.61 |
| gte-Qwen2-1.5B-instruct | Original script | 95.64 | 79.61 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 95.77 | 79.61 |
Reranking
| source | SciDocsRR | AskUbuntuDupQuestions | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 87.89 | 66 |
| gte-Qwen1.5-7B-instruct | Pull request | 85.14 | 58.03 |
| gte-Qwen1.5-7B-instruct | Original script | 87.62 | 64.4 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 87.61 | 64.4 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 86.52 | 64.55 |
| gte-Qwen2-1.5B-instruct | Pull request | 83.27 | 62.27 |
| gte-Qwen2-1.5B-instruct | Original script | 86.85 | 64.02 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 86.85 | 64.02 |
Retrieval
| source | SCIDOCS | SciFact | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 27.69 | 75.31 |
| gte-Qwen1.5-7B-instruct | Pull request | 26.1 | 76.33 |
| gte-Qwen1.5-7B-instruct | Original script | 25.71 | 76.58 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 25.73 | 76.57 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 24.98 | 78.44 |
| gte-Qwen2-1.5B-instruct | Pull request | 24.79 | 79.12 |
| gte-Qwen2-1.5B-instruct | Original script | 23.69 | 76.23 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 23.69 | 76.14 |
STS
| source | STS16 | STSBenchmark | |
|---|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 86.39 | 87.35 |
| gte-Qwen1.5-7B-instruct | Pull request | 86.38 | 87.63 |
| gte-Qwen1.5-7B-instruct | Original script | 86.44 | 87.64 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 86.44 | 87.64 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 85.45 | 86.38 |
| gte-Qwen2-1.5B-instruct | Pull request | 84.85 | 85.92 |
| gte-Qwen2-1.5B-instruct | Original script | 84.92 | 86.06 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 84.92 | 86.06 |
Summarization
| source | SummEval | |
|---|---|---|
| gte-Qwen1.5-7B-instruct | Leaderboard | 31.46 |
| gte-Qwen1.5-7B-instruct | Pull request | 31.51 |
| gte-Qwen1.5-7B-instruct | Original script | 31.37 |
| gte-Qwen1.5-7B-instruct | Original script normalized | 31.37 |
| gte-Qwen2-1.5B-instruct | Leaderboard | 31.17 |
| gte-Qwen2-1.5B-instruct | Pull request | 31.06 |
| gte-Qwen2-1.5B-instruct | Original script | 31.12 |
| gte-Qwen2-1.5B-instruct | Original script normalized | 31.12 |
@AlexeyVatolin
Thansk a lot! We have used execution with normalized. Sorry for these mirrors in our scripts.
I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.
I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.
Okay, we will update the scores in our metadata and the results in https://github.com/embeddings-benchmark/results.
@afalf, I noticed that the model results from the eval_mteb.py script do not match the results from the example in the readme (example in gist). Maybe is it required to update the readme to match with eval_mteb.py?
is AmazonCounterfactualClassification 83.16 or 86.15? why the leaderboard is 86.15. My reproduction gives 83.16 similar to the number reproduced in this issue above.
@yxchng, how do you run the evaluation?
# get reference implementation in MTEB, this is the implementation we assume is correct
model = mteb.get_model("{name}")
# get the task
task = mteb.get_task("AmazonCounterfactualClassification")
evaluator = mteb.MTEB(tasks = [task])
results = evaluator.run(model)
@KennethEnevoldsen yes this is how i run. The result above also show 83.16. Is the result in the screenshot wrong?
I think on leaderboard scores reported by authors, but we can't reproduce them
So, replace it with scores using our implementation if we can't reproduce it?
BTW, the results of nv-embed-v2 on FEVER, ClimateFEVER, Touche are underestimate currently, I think due to the misuse of instruction. From my own testing, if using the correct instructions (as stated in their paper), the results of nv-embed-v2 should be similar or even higher than ours (FEVER 0.95, ClimateFEVER 0.45, Touche 0.65).
https://github.com/embeddings-benchmark/results/pull/205#issuecomment-2913333334