The issue of the excessive generation length of correct/incorrect answers and ASR evaluation standards.

Open KennyCaty opened this issue 3 months ago • 0 comments

Hi, your attack method in the work is very interesting, but I've noticed that if I run the gen_adv.py file from scratch, the generated correct answers and incorrect answers seem to be quite long. This appears to be inconsistent with the results you provided in adv_targeted_results. Could you please clarify if you used manual review or additional prompt constraints?

The LLM I am using is gpt-4o-mini, but I saw in the issues that someone using gpt-4 got the same result.

In addition, the step for generating the correct answer in gen_adv.py involves calling the LLM twice (once for a direct query and once with the ground truth document included) and comparing them using string matching. Due to the aforementioned issue, this approach seems to be resulting in a large number of queries failing the match and being skipped.

The aforementioned issue of overly long generated answers also leads to the failure of string-matching-based ASR evaluation, resulting in a significant drop in the ASR score.

Could you please provide suggestions on reproducing the experiment and on the results you provided?

Nov 10 '25 05:11 KennyCaty