LangBridge icon indicating copy to clipboard operation
LangBridge copied to clipboard

Facing issue in replicating the numbers.

Open ayushayush591 opened this issue 1 year ago • 4 comments

Please suggest the solution for the followng issues:-

  1. Why are results from your checkpoint is not same as the replicated one? even though in your code seed was set to 42. did you took some avg. and then reported those numbers:-

       **Result from your checkpoint**:- which is nearby to the reported numbers.
    
         Task  |Version|Metric|Value|   |Stderr|
       |--------|------:|------|----:|---|-----:|
       |copa         |           0|acc   |0.900|±  |0.0302|
       |xcopa_sw |      0|acc   |0.750|±  |0.0194|
       |xcopa_ta  |      0|acc   |0.772|±  |0.0188|
       |xcopa_zh. |      0|acc   |0.852|±  |0.0159|
    
    
       **Result from replicated checpoint** (with all the hyperparameters are same):-
    
       |  Task  |Version|Metric|Value|   |Stderr|
       |--------|------:|------|----:|---|-----:|
       |copa      |      0|acc   |0.930|±  |0.0256|
       |xcopa_sw|      0|acc   |0.744|±  |0.0195|
       |xcopa_ta|      0|acc   |0.738|±  |0.0197|
       |xcopa_zh|      0|acc   |0.836|±  |0.0166|
    
  2. And also changing transformer version is making huge change in inference please check the issue 11 last comment.

ayushayush591 avatar Jan 27 '25 08:01 ayushayush591

And also training again led to difference in the result even though the seed is being set to 42 which i confirmed in your code, why this is happening did you face similar problem.

ayushayush591 avatar Jan 29 '25 06:01 ayushayush591

Hey @ayushayush591 , thank you for your interest in our work.

  1. COPA and XCOPA have relatively small val/test size. 100 for COPA and 500 for XCOPA. I speculate that's what's causing the variance. To me your reported replication numbers does not seem too off.

  2. I'm aware of that issue, but I currently don't have the capacity to investigate why. Sorry for the inconvenience.

MattYoon avatar Feb 03 '25 06:02 MattYoon

Thank you for your response!

  1. That makes sense, but what I have observed is when I retrain the same model with the same seed, I get different results. For example, in the case of COPA in English, sometimes I get 87, other times 90, and sometimes 93. It seems like the evaluation metric i.e. log-likelihood, might not be consistent for the same examples across different runs, which could be causing the variance.

  2. No worries! I will try to look into it and will let you know if I manage to find anything.

ayushayush591 avatar Feb 03 '25 06:02 ayushayush591

hmm I don't recall experiencing such issue, but I'm not 100% sure.

Does the issue exist for other datasets or only for COPA and XCOPA? Since those are the only non-generation tasks in the paper your speculation might be correct. The lm eval harness code for log-likelihood might not work well on our custom architecture.

MattYoon avatar Feb 03 '25 08:02 MattYoon