mteb Investigate performance discrepancies in gte-Qwen and NV-embed models

From https://github.com/embeddings-benchmark/mteb/pull/1436

Dec 16 '24 08:12 isaac-chung

Hello,

I conducted a comparison of the models using the examples provided in the readme.md file for each model. Here's a summary of my findings:

Alibaba-NLP/gte-Qwen2-7B-instruct
Alibaba-NLP/gte-Qwen1.5-7B-instruct
Alibaba-NLP/gte-Qwen2-1.5B-instruct
Linq-AI-Research/Linq-Embed-Mistral

For these models, I found that all three implementations (i.e., Transformers AutoModel, sentence_transformers, and mteb) are exactly the same. This consistency is great to see.
nvidia/NV-Embed-v2
nvidia/NV-Embed-v1

In these cases, the official implementation of Transformers AutoModel differs from the official sentence_transformers implementation, which is unexpected. The implementation in mteb aligns completely with sentence_transformers.

I also wanted to share the code I used for this comparison: View the Gist

Please note that questions regarding the correctness of prompt usage were not within the scope of this study. However, it does highlight that the models added to mteb are correctly implemented.

P.S. I created a discussion in the nvidia repository about this problem

Dec 17 '24 12:12 AlexeyVatolin

Qwen model repository includes a script to calculate scores for their models on the MTEB benchmark. I ran this script on the same tasks covered in my pull request.

The results from the original script are, in most cases, worse than those reported on the leaderboard and also fall short when compared to results obtained using the code from the MTEB models. Here is my command to run this script

OPENBLAS_NUM_THREADS=8 python scripts/eval_mteb.py -m Alibaba-NLP/gte-Qwen2-1.5B-instruct --output_dir results_qwen_2_1.5b_eval_mteb --task mteb

Additionally, there is a open discussion about this on the Qwen model repository.

Classification

	source	AmazonCounterfactualClassification	EmotionClassification	ToxicConversationsClassification
gte-Qwen1.5-7B-instruct	Leaderboard	83.16	54.53	78.75
gte-Qwen1.5-7B-instruct	Pull request	81.78	54.91	77.25
gte-Qwen1.5-7B-instruct	Original script	67.87	46.08	59.06
gte-Qwen2-1.5B-instruct	Leaderboard	83.99	61.37	82.66
gte-Qwen2-1.5B-instruct	Pull request	82.51	65.66	84.54
gte-Qwen2-1.5B-instruct	Original script	71.81	54.56	65.1

Clustering

	source	ArxivClusteringS2S	RedditClustering
gte-Qwen1.5-7B-instruct	Leaderboard	51.45	73.37
gte-Qwen1.5-7B-instruct	Pull request	53.57	80.12
gte-Qwen1.5-7B-instruct	Original script	47.88	64.43
gte-Qwen2-1.5B-instruct	Leaderboard	45.01	55.82
gte-Qwen2-1.5B-instruct	Pull request	44.61	51.36
gte-Qwen2-1.5B-instruct	Original script	41.1	52.53

PairClassification

	source	SprintDuplicateQuestions	TwitterSemEval2015
gte-Qwen1.5-7B-instruct	Leaderboard	96.07	79.36
gte-Qwen1.5-7B-instruct	Pull request	94.51	80.72
gte-Qwen1.5-7B-instruct	Original script	91.44	61.92
gte-Qwen2-1.5B-instruct	Leaderboard	95.32	79.64
gte-Qwen2-1.5B-instruct	Pull request	91.19	75.93
gte-Qwen2-1.5B-instruct	Original script	93.87	74.59

Reranking

	source	SciDocsRR	AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct	Leaderboard	87.89	66
gte-Qwen1.5-7B-instruct	Pull request	88.26	64.03
gte-Qwen1.5-7B-instruct	Original script	85.2	57.32
gte-Qwen2-1.5B-instruct	Leaderboard	86.52	64.55
gte-Qwen2-1.5B-instruct	Pull request	85.67	62.33
gte-Qwen2-1.5B-instruct	Original script	83.51	60.47

Retrieval

	source	SCIDOCS	SciFact
gte-Qwen1.5-7B-instruct	Leaderboard	27.69	75.31
gte-Qwen1.5-7B-instruct	Pull request	26.34	75.8
gte-Qwen1.5-7B-instruct	Original script	22.38	74.34
gte-Qwen2-1.5B-instruct	Leaderboard	24.98	78.44
gte-Qwen2-1.5B-instruct	Pull request	23.4	77.47
gte-Qwen2-1.5B-instruct	Original script	21.92	75.81

STS

	source	STS16	STSBenchmark
gte-Qwen1.5-7B-instruct	Leaderboard	86.39	87.35
gte-Qwen1.5-7B-instruct	Pull request	85.98	86.86
gte-Qwen1.5-7B-instruct	Original script	81.33	83.65
gte-Qwen2-1.5B-instruct	Leaderboard	85.45	86.38
gte-Qwen2-1.5B-instruct	Pull request	84.71	84.71
gte-Qwen2-1.5B-instruct	Original script	85.35	86.04

Summarization

	source	SummEval
gte-Qwen1.5-7B-instruct	Leaderboard	31.46
gte-Qwen1.5-7B-instruct	Pull request	31.22
gte-Qwen1.5-7B-instruct	Original script	30.07
gte-Qwen2-1.5B-instruct	Leaderboard	31.17
gte-Qwen2-1.5B-instruct	Pull request	30.5
gte-Qwen2-1.5B-instruct	Original script	28.99

Dec 17 '24 15:12 AlexeyVatolin

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

Dec 22 '24 20:12 KennethEnevoldsen

Right from this is seems like we should update the scores on the leaderboard with the new reproducible scores. Since the authors has been made aware (issue on NVIDIA and on QWEN) I believe this is a fair decision to make.

@AlexeyVatolin have you run the models, otherwise I will ask Niklas to rerun them

I'm a member of the gte-Qwen series model. Sorry, we checked and found some errors in the previous script. It have now been updated and verified to be consistent with the results on the leaderboard. Please try again with the latest script to check the results.

Dec 24 '24 08:12 afalf

@afalf, thanks a lot! I've run the gte-Qwen models with the updated script and will post as soon as I have results

Dec 24 '24 11:12 AlexeyVatolin

@afalf, I have reviewed the updated script and noticed a few minor errors that were preventing it from running. I plan to submit a pull request to your Hugging Face repository later. After correcting these issues, the results have been very promising. In fact, when employing normalization - which I regrettably forgot to include last time, despite it being used in the example -the metrics slightly surpass those on the leaderboard. Could you please clarify whether the intended execution is with or without normalization?

Additionally, I compared the script with the code in mteb/gritlm and identified some differences. I have managed to adjust the model in mteb to produce results almost identical to those of the original script. You will find the corrections in my pull request. #1637

Here is average scores

	leaderboard	Original script	Original script normalized	Pull request
gte-Qwen1.5-7B-instruct	69.9129	69.5629	70.1543	69.5436
gte-Qwen2-1.5B-instruct	68.6643	68.33	68.74	68.6293

Classification

	source	AmazonCounterfactualClassification	EmotionClassification	ToxicConversationsClassification
gte-Qwen1.5-7B-instruct	Leaderboard	83.16	54.53	78.75
gte-Qwen1.5-7B-instruct	Pull request	81.51	55.34	76.44
gte-Qwen1.5-7B-instruct	Original script	81.79	49.3	73.88
gte-Qwen1.5-7B-instruct	Original script normalized	81.49	55.35	76.46
gte-Qwen2-1.5B-instruct	Leaderboard	83.99	61.37	82.66
gte-Qwen2-1.5B-instruct	Pull request	85.81	64.67	82.93
gte-Qwen2-1.5B-instruct	Original script	84.04	61.04	82.29
gte-Qwen2-1.5B-instruct	Original script normalized	85.82	64.68	82.94

Clustering

	source	ArxivClusteringS2S	RedditClustering
gte-Qwen1.5-7B-instruct	Leaderboard	51.45	73.37
gte-Qwen1.5-7B-instruct	Pull request	53.16	80.14
gte-Qwen1.5-7B-instruct	Original script	53.17	80.06
gte-Qwen1.5-7B-instruct	Original script normalized	53.16	80.03
gte-Qwen2-1.5B-instruct	Leaderboard	45.01	55.82
gte-Qwen2-1.5B-instruct	Pull request	44.96	55.78
gte-Qwen2-1.5B-instruct	Original script	45.05	56.06
gte-Qwen2-1.5B-instruct	Original script normalized	45.02	55.72

PairClassification

	source	SprintDuplicateQuestions	TwitterSemEval2015
gte-Qwen1.5-7B-instruct	Leaderboard	96.07	79.36
gte-Qwen1.5-7B-instruct	Pull request	94.96	80.94
gte-Qwen1.5-7B-instruct	Original script	94.98	80.94
gte-Qwen1.5-7B-instruct	Original script normalized	94.96	80.95
gte-Qwen2-1.5B-instruct	Leaderboard	95.32	79.64
gte-Qwen2-1.5B-instruct	Pull request	95.77	79.61
gte-Qwen2-1.5B-instruct	Original script	95.64	79.61
gte-Qwen2-1.5B-instruct	Original script normalized	95.77	79.61

Reranking

	source	SciDocsRR	AskUbuntuDupQuestions
gte-Qwen1.5-7B-instruct	Leaderboard	87.89	66
gte-Qwen1.5-7B-instruct	Pull request	85.14	58.03
gte-Qwen1.5-7B-instruct	Original script	87.62	64.4
gte-Qwen1.5-7B-instruct	Original script normalized	87.61	64.4
gte-Qwen2-1.5B-instruct	Leaderboard	86.52	64.55
gte-Qwen2-1.5B-instruct	Pull request	83.27	62.27
gte-Qwen2-1.5B-instruct	Original script	86.85	64.02
gte-Qwen2-1.5B-instruct	Original script normalized	86.85	64.02

Retrieval

	source	SCIDOCS	SciFact
gte-Qwen1.5-7B-instruct	Leaderboard	27.69	75.31
gte-Qwen1.5-7B-instruct	Pull request	26.1	76.33
gte-Qwen1.5-7B-instruct	Original script	25.71	76.58
gte-Qwen1.5-7B-instruct	Original script normalized	25.73	76.57
gte-Qwen2-1.5B-instruct	Leaderboard	24.98	78.44
gte-Qwen2-1.5B-instruct	Pull request	24.79	79.12
gte-Qwen2-1.5B-instruct	Original script	23.69	76.23
gte-Qwen2-1.5B-instruct	Original script normalized	23.69	76.14

STS

	source	STS16	STSBenchmark
gte-Qwen1.5-7B-instruct	Leaderboard	86.39	87.35
gte-Qwen1.5-7B-instruct	Pull request	86.38	87.63
gte-Qwen1.5-7B-instruct	Original script	86.44	87.64
gte-Qwen1.5-7B-instruct	Original script normalized	86.44	87.64
gte-Qwen2-1.5B-instruct	Leaderboard	85.45	86.38
gte-Qwen2-1.5B-instruct	Pull request	84.85	85.92
gte-Qwen2-1.5B-instruct	Original script	84.92	86.06
gte-Qwen2-1.5B-instruct	Original script normalized	84.92	86.06

Summarization

	source	SummEval
gte-Qwen1.5-7B-instruct	Leaderboard	31.46
gte-Qwen1.5-7B-instruct	Pull request	31.51
gte-Qwen1.5-7B-instruct	Original script	31.37
gte-Qwen1.5-7B-instruct	Original script normalized	31.37
gte-Qwen2-1.5B-instruct	Leaderboard	31.17
gte-Qwen2-1.5B-instruct	Pull request	31.06
gte-Qwen2-1.5B-instruct	Original script	31.12
gte-Qwen2-1.5B-instruct	Original script normalized	31.12

Dec 27 '24 19:12 AlexeyVatolin

@AlexeyVatolin

Thansk a lot! We have used execution with normalized. Sorry for these mirrors in our scripts.

Dec 28 '24 00:12 afalf

I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.

Dec 29 '24 15:12 KennethEnevoldsen

I have reviewed the PR and everything looks good. @afalf you might want to resubmit the scores using the new scores given the improvements.

Okay, we will update the scores in our metadata and the results in https://github.com/embeddings-benchmark/results.

Dec 30 '24 02:12 afalf

@afalf, I noticed that the model results from the eval_mteb.py script do not match the results from the example in the readme (example in gist). Maybe is it required to update the readme to match with eval_mteb.py?

Dec 30 '24 15:12 AlexeyVatolin

is AmazonCounterfactualClassification 83.16 or 86.15? why the leaderboard is 86.15. My reproduction gives 83.16 similar to the number reproduced in this issue above.

Mar 24 '25 14:03 yxchng

@yxchng, how do you run the evaluation?

 # get reference implementation in MTEB, this is the implementation we assume is correct
model = mteb.get_model("{name}")

# get the task
task = mteb.get_task("AmazonCounterfactualClassification")

evaluator = mteb.MTEB(tasks = [task])
results = evaluator.run(model)

Mar 27 '25 11:03 KennethEnevoldsen

@KennethEnevoldsen yes this is how i run. The result above also show 83.16. Is the result in the screenshot wrong?

Mar 27 '25 15:03 yxchng

I think on leaderboard scores reported by authors, but we can't reproduce them

Mar 27 '25 16:03 Samoed

So, replace it with scores using our implementation if we can't reproduce it?

for NVIDIA there seems to a fix here, but that seems overly specific for us to implement and will likely not reflect actual use
for Qwen the seem to have responded here and updated the script (within the last 3 month). I can't see anything here that would influence Classification.

Mar 27 '25 18:03 KennethEnevoldsen

BTW, the results of nv-embed-v2 on FEVER, ClimateFEVER, Touche are underestimate currently, I think due to the misuse of instruction. From my own testing, if using the correct instructions (as stated in their paper), the results of nv-embed-v2 should be similar or even higher than ours (FEVER 0.95, ClimateFEVER 0.45, Touche 0.65).

https://github.com/embeddings-benchmark/results/pull/205#issuecomment-2913333334

May 27 '25 17:05 Samoed