sammo icon indicating copy to clipboard operation
sammo copied to clipboard

Issues with Prompt Mutation in SAMMO with vllm and Azure GPT4 Integration

Open HenryLau7 opened this issue 1 year ago • 6 comments

Hi Team,

We are currently using SAMMO to deploy our own LLMs, including vllm (running on OpenAI local host) and Azure GPT4. However, we’ve encountered two bugs that we are unable to address due to the encapsulation of the system. The only modification we’ve made is to the runner, as mentioned above.

Bug 1: Mutated Prompts Not Generated After Calling APO

When calling the API, we are unable to generate mutated prompts. Upon checking the automatically generated logs, it appears that gradients are being generated successfully. However, the meta-prompt that is supposed to apply these gradients fails to render {{gradients}} correctly.

[prompt to get gradient]

Give 1 reasons why the prompt could have gotten these examples wrong.Wrap each reason with <START> and </START>.
->

<START>One significant reason the prompt could have gotten these examples wrong is due to its reliance on explicit "yes" or "no" answers without considering the context or the implied meaning behind Speaker 2's responses. The prompt does not guide the model to interpret the nuances and indirect ways people often communicate agreement or disagreement, especially in conversational language where affirmations or negations can be implied rather than directly stated. This limitation makes it challenging for the model to accurately classify responses that are contextually affirmative or negative but do not contain clear "yes" or "no" keywords.<END>

[prompt to get gradient]


Based on these examples the problem with this prompt is that:

{This section is unexpectedly empty}

Based on the above information, I wrote 1 different improved prompt. Each prompt is wrapped with <START> and </START>.
The 1 new prompts are:

Bug 2: Failure to Generate Mutated Prompts at Any Search Depth in APO

Regardless of the search depth setting, the system fails to generate mutated prompts. Below is a snippet from the logs showing this behavior:

USER ~/sammo$ python instruction_tuning_sammo.py --llm vllm --task-id implicatures --method apo

search depth[                                                 ]0/8[00:00<00:00]
search depth[                ]0/8[00:00<00:01] >> eval[
search depth[                ]0/8[00:00<00:01] >> eval[    ]0/1 >> tasks[       ]0/100[00:00<00:01, 58
search depth[                ]0/8[00:00<00:01] >> eval[    ]0/1 >> tasks[######]100/100[00:00<00:00, 230768.85it/s]
10:07:39,270: Best at depth=-1: 0.33999999999999997
search depth[                ]0/8[00:00<00:01] >> mutate[
search depth[                ]0/8[00:00<00:01] >> mutate[###########################################################]1/1
10:07:39,349: Best: 0.33999999999999997

Fitting Log (1 entry):

iteration action objective costs parse_errors prev_actions
-1 {'decision_0': '"Does Speaker 2's answer mean yes or no?"'} 0.33999999999999997 {'input': 6169, 'output': 178} 0.0 {'decision_0': '"Does Speaker 2's answer mean yes or no?"'}

Action Stats:

action stats
inference inference[ ]0/100[00:00<00:00] >> inference[###################################################]100/100[00:00<00:00]

Test Score:

name value
score 0.4

Any insights on how we can resolve these issues would be greatly appreciated.

Thank you for your assistance!

HenryLau7 avatar Sep 24 '24 03:09 HenryLau7

Hi there,

can you share a minimal reproducible example? The second issue seems to be a result probably of the first one. Happy to look into it.

t-schn avatar Sep 27 '24 20:09 t-schn

@HenryLau7: In the newest release (0.2.5), I added a more in-depth test case for APO here: https://github.com/microsoft/sammo/blob/24ee50d83182d21f7581b4d38a907c66e723ea75/sammo/mutators_test.py#L107-L126

You can also set a break point here: https://github.com/microsoft/sammo/blob/24ee50d83182d21f7581b4d38a907c66e723ea75/sammo/mutators.py#L515

And then use (await Output(prompt_variants).arun(runner)).outputs.raw_values[0].plot_call_trace() to see what intermediary values are being computed in the prompt program.

t-schn avatar Sep 30 '24 16:09 t-schn

@t-schn : Hi, thanks for your help! Using plot_call_trace(), I was able to identify the successfully generated gradients and mutated prompts. However, it seems that the mutated prompts aren't being evaluated via beam search, and the fitting log of main program only shows traces for round -1 and round 0 where I set the depth as a larger value. Could you assist me in debugging this issue? Is there anything additional I should provide?

HenryLau7 avatar Oct 06 '24 06:10 HenryLau7

@HenryLau7: It might be an issue with the objective function. It might help to run fit_transform and then look at the predicted values or set a breakpoint in the objective function. Can you also show the log?

t-schn avatar Oct 07 '24 18:10 t-schn

@t-schn : Hi, Thank you for your suggestion! We eventually found that the issue was caused by the ExtractReg process failing when generating and applying the gradients. The results often didn’t follow the prompt’s required format of using <START> and </START>, but instead preferred <START> and <END> (which might be a preference of GPT-4 for such placeholders). With this improvement, APO is now functioning correctly.

I have another question: I would like to reproduce SAMMO. Could you kindly provide a reproducible demo, especially focusing on the construction of the search space (InstructionTuningSearchSpace and BagOfMutators)? Thanks a lot!

HenryLau7 avatar Oct 11 '24 08:10 HenryLau7

@HenryLau7: Great! Yes, the older LLMs might not follow instructions as well. We will also integrate support for constrained decoding soon. Here are the examples from the paper that are more advanced and show how to combine mutators: https://github.com/microsoft/sammo/tree/main/examples

t-schn avatar Oct 11 '24 18:10 t-schn