sammo Issues with Prompt Mutation in SAMMO with vllm and Azure GPT4 Integration

Hi Team,

We are currently using SAMMO to deploy our own LLMs, including vllm (running on OpenAI local host) and Azure GPT4. However, we’ve encountered two bugs that we are unable to address due to the encapsulation of the system. The only modification we’ve made is to the runner, as mentioned above.

Bug 1: Mutated Prompts Not Generated After Calling APO

When calling the API, we are unable to generate mutated prompts. Upon checking the automatically generated logs, it appears that gradients are being generated successfully. However, the meta-prompt that is supposed to apply these gradients fails to render {{gradients}} correctly.

[prompt to get gradient]

Give 1 reasons why the prompt could have gotten these examples wrong.Wrap each reason with <START> and </START>.
->

<START>One significant reason the prompt could have gotten these examples wrong is due to its reliance on explicit "yes" or "no" answers without considering the context or the implied meaning behind Speaker 2's responses. The prompt does not guide the model to interpret the nuances and indirect ways people often communicate agreement or disagreement, especially in conversational language where affirmations or negations can be implied rather than directly stated. This limitation makes it challenging for the model to accurately classify responses that are contextually affirmative or negative but do not contain clear "yes" or "no" keywords.<END>

[prompt to get gradient]


Based on these examples the problem with this prompt is that:

{This section is unexpectedly empty}

Based on the above information, I wrote 1 different improved prompt. Each prompt is wrapped with <START> and </START>.
The 1 new prompts are:

Bug 2: Failure to Generate Mutated Prompts at Any Search Depth in APO

Regardless of the search depth setting, the system fails to generate mutated prompts. Below is a snippet from the logs showing this behavior:

USER ~/sammo$ python instruction_tuning_sammo.py --llm vllm --task-id implicatures --method apo

search depth[                                                 ]0/8[00:00<00:00]
search depth[                ]0/8[00:00<00:01] >> eval[
search depth[                ]0/8[00:00<00:01] >> eval[    ]0/1 >> tasks[       ]0/100[00:00<00:01, 58
search depth[                ]0/8[00:00<00:01] >> eval[    ]0/1 >> tasks[######]100/100[00:00<00:00, 230768.85it/s]
10:07:39,270: Best at depth=-1: 0.33999999999999997
search depth[                ]0/8[00:00<00:01] >> mutate[
search depth[                ]0/8[00:00<00:01] >> mutate[###########################################################]1/1
10:07:39,349: Best: 0.33999999999999997

Fitting Log (1 entry):

iteration	action	objective	costs	parse_errors	prev_actions
-1	`{'decision_0': '"Does Speaker 2's answer mean yes or no?"'}`	`0.33999999999999997`	`{'input': 6169, 'output': 178}`	`0.0`	`{'decision_0': '"Does Speaker 2's answer mean yes or no?"'}`

Action Stats:

action	stats
inference	inference[ ]0/100[00:00<00:00] >> inference[###################################################]100/100[00:00<00:00]

Test Score:

name	value
score	0.4

Any insights on how we can resolve these issues would be greatly appreciated.

Thank you for your assistance!

Sep 24 '24 03:09 HenryLau7

Hi there,

can you share a minimal reproducible example? The second issue seems to be a result probably of the first one. Happy to look into it.

Sep 27 '24 20:09 t-schn

@HenryLau7: In the newest release (0.2.5), I added a more in-depth test case for APO here: https://github.com/microsoft/sammo/blob/24ee50d83182d21f7581b4d38a907c66e723ea75/sammo/mutators_test.py#L107-L126

You can also set a break point here: https://github.com/microsoft/sammo/blob/24ee50d83182d21f7581b4d38a907c66e723ea75/sammo/mutators.py#L515

And then use (await Output(prompt_variants).arun(runner)).outputs.raw_values[0].plot_call_trace() to see what intermediary values are being computed in the prompt program.

Sep 30 '24 16:09 t-schn

@t-schn : Hi, thanks for your help! Using plot_call_trace(), I was able to identify the successfully generated gradients and mutated prompts. However, it seems that the mutated prompts aren't being evaluated via beam search, and the fitting log of main program only shows traces for round -1 and round 0 where I set the depth as a larger value. Could you assist me in debugging this issue? Is there anything additional I should provide?

Oct 06 '24 06:10 HenryLau7

@HenryLau7: It might be an issue with the objective function. It might help to run fit_transform and then look at the predicted values or set a breakpoint in the objective function. Can you also show the log?

Oct 07 '24 18:10 t-schn

@t-schn : Hi, Thank you for your suggestion! We eventually found that the issue was caused by the ExtractReg process failing when generating and applying the gradients. The results often didn’t follow the prompt’s required format of using <START> and </START>, but instead preferred <START> and <END> (which might be a preference of GPT-4 for such placeholders). With this improvement, APO is now functioning correctly.

I have another question: I would like to reproduce SAMMO. Could you kindly provide a reproducible demo, especially focusing on the construction of the search space (InstructionTuningSearchSpace and BagOfMutators)? Thanks a lot!

Oct 11 '24 08:10 HenryLau7

@HenryLau7: Great! Yes, the older LLMs might not follow instructions as well. We will also integrate support for constrained decoding soon. Here are the examples from the paper that are more advanced and show how to combine mutators: https://github.com/microsoft/sammo/tree/main/examples

Oct 11 '24 18:10 t-schn