The issue of the excessive generation length of correct/incorrect answers and ASR evaluation standards.
Hi, your attack method in the work is very interesting, but I've noticed that if I run the gen_adv.py file from scratch, the generated correct answers and incorrect answers seem to be quite long. This appears to be inconsistent with the results you provided in adv_targeted_results. Could you please clarify if you used manual review or additional prompt constraints?
The LLM I am using is gpt-4o-mini, but I saw in the issues that someone using gpt-4 got the same result.
In addition, the step for generating the correct answer in gen_adv.py involves calling the LLM twice (once for a direct query and once with the ground truth document included) and comparing them using string matching. Due to the aforementioned issue, this approach seems to be resulting in a large number of queries failing the match and being skipped.
The aforementioned issue of overly long generated answers also leads to the failure of string-matching-based ASR evaluation, resulting in a significant drop in the ASR score.
Could you please provide suggestions on reproducing the experiment and on the results you provided?