ProgramFC
ProgramFC copied to clipboard
Results on GPT-4 are lower than the reuslts presented in the paper?
Great jobs. I have some questions for the authors.
- I run the code on the GPT-4 with the same parameter settings, but the results (macro-F1) for using GPT-4 as the program generator (N=1, gold), but the results on FEVEROUS are lower than the results using text-davinci-003 presented in the github . FEVEROUS with GPT4: 91.05 FEVEROUS with text-davinci-003: 92.32 (presented in the github) This result is very confusing.
- I would like to know if the results reported in the paper as well as github are in the full dataset or the partially sampled dataset?