MiniOneRec Reproduction Results of SFT and RL Are Lower Than Reported

Firstly, thank you for the great work!

We have reproduced the experiments using both SFT and RL, following the exact hyperparameters provided in the paper. The experiments were conducted on a single node with 8× H100 GPUs.

After SFT, our results are:

Top-K: [1, 3, 5, 10, 20, 50] NDCG: [0.0671, 0.0804, 0.0865, 0.0969, 0.1045, 0.1142] HR: [0.0671, 0.0902, 0.1050, 0.1370, 0.1672, 0.2160]

After RL, our results are:

Top-K: [1, 3, 5, 10, 20, 50] NDCG: [0.0743, 0.0883, 0.0935, 0.1001, 0.1052, 0.1100] HR: [0.0743, 0.0984, 0.1107, 0.1313, 0.1516, 0.1756]

However, these results are consistently lower than those reported in the paper. Since no hyperparameters were changed on our side, we are wondering whether:

There are any additional implementation details that are not explicitly mentioned in the paper?

Is there any recommended training trick or configuration that may significantly affect performance?

Are the reported results averaged over multiple runs or using a different random seed?

Any advice on improving reproducibility?

Any insights or suggestions would be greatly appreciated!

Dec 02 '25 09:12 maobenz

Hi, thanks for your interest in our work!

MiniOneRec is fully reproducible in our environment. Regarding the issue you mentioned, we found that differences in dependency versions can sometimes lead to noticeable variations in performance. For example, earlier versions of trl caused a significant drop in model accuracy.

Since my compute resources are currently limited, could you please first make sure your setup aligns with the latest version of our code? I will re-verify the full reproducibility on my side within the next few days.

If you have any further observations or updates, feel free to reach out at any time.

Thanks again for your patience and support!

Dec 02 '25 10:12 AkaliKong

We initially suspected that RL might be the cause of the issue. Could you confirm whether our SFT results are within the expected range? We compared our results with the figure in your paper, although it only reports HR@10. Based on this, we believe our results are comparable.

Dec 02 '25 11:12 maobenz

One version of our SFT-only model achieves the following metrics: Top-K: [1, 3, 5, 10, 20, 50] NDCG: [0.0655, 0.0828, 0.0907, 0.0999, 0.1095, 0.1230] HR: [0.0655, 0.0953, 0.1145, 0.1430, 0.1811, 0.2491]

We then observed a clear performance improvement after applying reinforcement learning. Hope this information is helpful to you!

Dec 02 '25 11:12 AkaliKong

Thanks. But I have one more question: the result from the Industrial_and_Scientific dataset? In Figure 1 of the paper, the HR@10 after SFT is below 0.14, about 0.135, but your result shows 0.143. Is there a mistake somewhere?

Dec 02 '25 12:12 maobenz

Hello, after the paper was released, we made several updates to the code, including new RQ methods and fixes for some minor bugs, which led to a certain degree of performance improvement.

Dec 02 '25 12:12 AkaliKong

My collaborators and I verify that we can successfully reproduce MiniOneRec in our respective environments. We believe the issues you are currently facing may be related to differences in dependency versions.

Wish you smooth experiments, and we look forward to your further feedback!

Dec 02 '25 12:12 AkaliKong

@maobenz hi were you able to figure this out?

I'm using 1.5B due to smaller compute, and after SFT I get test ndcg@10 of 0.0961, which seems to be in line with the original 7B results. But after 1 epoch of RL, it drops to 0.0896 (test ndcg drops but val ndcg does improve by 0.006, which is still not that much).

Dec 09 '25 07:12 ashleys0

Reproduction Results of SFT and RL Are Lower Than Reported — Any Suggestions?