Qwen2.5-0.5B outputs JSON-like text instead of SID tokens during evaluation
Hi authors, thank you for releasing MiniOneRec! I am trying to reproduce the SFT results, but using a smaller backbone:
Base model: Qwen2.5-0.5B-Instruct
Dataset: Amazon-18 Musical Instruments (processed as described)
Training script: same as yours (LoRA + SFT)
Evaluation: unchanged evaluate.py
Training runs fine and loss decreases normally. However, during evaluation, the model's predict field looks like this: ··· { "501": { "780": { "700": { ...
While the expected output should be SID tokens:
<a_56><b_249><c_7>
My question:
Is Qwen2.5-0.5B too small to follow the strict SID token format?
Does MiniOneRec require freezing the base LLM during SFT?
Is the natural-language prompt (e.g., “Can you predict the next item…?”) incompatible with the SFT format?
Should I enforce a structured prompt template for evaluation?
Do you recommend using only <a_XX><b_XX><c_XX>-style prompts without natural language for smaller backbones?
Any suggestions would be greatly appreciated. Thank you!
Thank you for your interest in our work!
Based on your description, I believe the issue may be caused by not correctly configuring the constrained decoding part. Its code is located in minionerec_trainer, in the prefixID section. Please check whether that part is set correctly. If configured properly, the model should only be able to output valid item SIDs. Qwen2.5‑0.5B is sufficient for basic reproduction and can understand natural language; SFT does not require freezing the base LLM. Wishing you success with your experiments!
Hi, thanks for the great project!
I successfully enabled the SID-constrained generation — the model now produces valid semantic-ID outputs, so that part works well. Thank you!
However, the evaluation performance is extremely low on my side. For example, on Musical Instruments, the results are:
<a_56><b_51><c_119>
<a_56><b_51><c_119>
<a_56><b_51><c_119>
816it [00:00, 8522.52it/s]
50
[1, 3, 5, 10, 20, 50]
NDCG: [0.0122549 0.02195651 0.02554362 0.0370783 0.04688769 0.05256056]
HR [0.0122549 0.02941176 0.0379902 0.07352941 0.11151961 0.13970588]
8682
Completed processing for category: Musical_Instruments
Results saved to: ./results/final_checkpoint/final_result_Musical_Instruments.json
----------------------------------------
All categories processed!
This is much lower than expected.
I used the official training/evaluation pipeline and the official dataset format (<a_x><b_y><c_z> title id>). The SID constraints are correctly applied (I checked that generation outputs are valid SIDs), but the final ranking accuracy is still very poor.
Could you please advise what might be causing this performance issue, or whether there is something specific required for evaluation?
Thanks again for your work!
Hello!
The "8682" under "HR [0.0122549 0.02941176 0.0379902 0.07352941 0.11151961 0.13970588]" means that there are 8682 outputs that are not in item_dict. This indicates that the model is still producing a large number of invalid items.
Please check whether the info_file is configured correctly. You can also manually check whether the items output in the results/ directory actually exist.
Feel free to reach out if you have any further questions!
Hello authors, thank you for your excellent work~ During the reproduction process, I found that my self-trained model sometimes fails to produce 50 results during inference, similar to the "8682" error above. <a_158><b_206><c_66>
<a_158><b_206><c_66>
<a_158><b_206><c_66>
<a_158><b_206><c_66>
<a_158><b_206><c_66>
<a_158><b_206><c_66>
<a_158><b_206><c_66> 4533it [00:01, 3368.78it/s] 50 [1, 3, 5, 10, 20, 50] NDCG: [0.06287227 0.08072239 0.08715868 0.09401486 0.0973854 0.09965383] HR [0.06287227 0.09331568 0.1089786 0.13037723 0.1436135 0.15486433] 92752
However, your publicly available model does not exhibit this issue. Could you please suggest a possible cause for this? I checked the decoded items, and they all met the specifications. The error "92752" was due to an empty result.