Issues with Prompt Mutation in SAMMO with vllm and Azure GPT4 Integration
Hi Team,
We are currently using SAMMO to deploy our own LLMs, including vllm (running on OpenAI local host) and Azure GPT4. However, we’ve encountered two bugs that we are unable to address due to the encapsulation of the system. The only modification we’ve made is to the runner, as mentioned above.
Bug 1: Mutated Prompts Not Generated After Calling APO
When calling the API, we are unable to generate mutated prompts. Upon checking the automatically generated logs, it appears that gradients are being generated successfully. However, the meta-prompt that is supposed to apply these gradients fails to render {{gradients}} correctly.
[prompt to get gradient]
Give 1 reasons why the prompt could have gotten these examples wrong.Wrap each reason with <START> and </START>.
->
<START>One significant reason the prompt could have gotten these examples wrong is due to its reliance on explicit "yes" or "no" answers without considering the context or the implied meaning behind Speaker 2's responses. The prompt does not guide the model to interpret the nuances and indirect ways people often communicate agreement or disagreement, especially in conversational language where affirmations or negations can be implied rather than directly stated. This limitation makes it challenging for the model to accurately classify responses that are contextually affirmative or negative but do not contain clear "yes" or "no" keywords.<END>
[prompt to get gradient]
Based on these examples the problem with this prompt is that:
{This section is unexpectedly empty}
Based on the above information, I wrote 1 different improved prompt. Each prompt is wrapped with <START> and </START>.
The 1 new prompts are:
Bug 2: Failure to Generate Mutated Prompts at Any Search Depth in APO
Regardless of the search depth setting, the system fails to generate mutated prompts. Below is a snippet from the logs showing this behavior:
USER ~/sammo$ python instruction_tuning_sammo.py --llm vllm --task-id implicatures --method apo
search depth[ ]0/8[00:00<00:00]
search depth[ ]0/8[00:00<00:01] >> eval[
search depth[ ]0/8[00:00<00:01] >> eval[ ]0/1 >> tasks[ ]0/100[00:00<00:01, 58
search depth[ ]0/8[00:00<00:01] >> eval[ ]0/1 >> tasks[######]100/100[00:00<00:00, 230768.85it/s]
10:07:39,270: Best at depth=-1: 0.33999999999999997
search depth[ ]0/8[00:00<00:01] >> mutate[
search depth[ ]0/8[00:00<00:01] >> mutate[###########################################################]1/1
10:07:39,349: Best: 0.33999999999999997
Fitting Log (1 entry):
| iteration | action | objective | costs | parse_errors | prev_actions |
|---|---|---|---|---|---|
| -1 | {'decision_0': '"Does Speaker 2's answer mean yes or no?"'} |
0.33999999999999997 |
{'input': 6169, 'output': 178} |
0.0 |
{'decision_0': '"Does Speaker 2's answer mean yes or no?"'} |
Action Stats:
| action | stats |
|---|---|
| inference | inference[ ]0/100[00:00<00:00] >> inference[###################################################]100/100[00:00<00:00] |
Test Score:
| name | value |
|---|---|
| score | 0.4 |
Any insights on how we can resolve these issues would be greatly appreciated.
Thank you for your assistance!
Hi there,
can you share a minimal reproducible example? The second issue seems to be a result probably of the first one. Happy to look into it.
@HenryLau7: In the newest release (0.2.5), I added a more in-depth test case for APO here: https://github.com/microsoft/sammo/blob/24ee50d83182d21f7581b4d38a907c66e723ea75/sammo/mutators_test.py#L107-L126
You can also set a break point here: https://github.com/microsoft/sammo/blob/24ee50d83182d21f7581b4d38a907c66e723ea75/sammo/mutators.py#L515
And then use (await Output(prompt_variants).arun(runner)).outputs.raw_values[0].plot_call_trace() to see what intermediary values are being computed in the prompt program.
@t-schn : Hi, thanks for your help! Using plot_call_trace(), I was able to identify the successfully generated gradients and mutated prompts. However, it seems that the mutated prompts aren't being evaluated via beam search, and the fitting log of main program only shows traces for round -1 and round 0 where I set the depth as a larger value. Could you assist me in debugging this issue? Is there anything additional I should provide?
@HenryLau7: It might be an issue with the objective function. It might help to run fit_transform and then look at the predicted values or set a breakpoint in the objective function. Can you also show the log?
@t-schn : Hi, Thank you for your suggestion! We eventually found that the issue was caused by the ExtractReg process failing when generating and applying the gradients. The results often didn’t follow the prompt’s required format of using <START> and </START>, but instead preferred <START> and <END> (which might be a preference of GPT-4 for such placeholders). With this improvement, APO is now functioning correctly.
I have another question: I would like to reproduce SAMMO. Could you kindly provide a reproducible demo, especially focusing on the construction of the search space (InstructionTuningSearchSpace and BagOfMutators)? Thanks a lot!
@HenryLau7: Great! Yes, the older LLMs might not follow instructions as well. We will also integrate support for constrained decoding soon. Here are the examples from the paper that are more advanced and show how to combine mutators: https://github.com/microsoft/sammo/tree/main/examples