garak icon indicating copy to clipboard operation
garak copied to clipboard

probes: add ArtPrompt probes

Open jmartin-tech opened this issue 2 years ago • 4 comments

Fix #535

Implements two prompt obfuscation patterns based on ArtPrompt

Testing work is still in progress here a new possible base case or detector specific to technique, may be needed. As current mitigation.MitigationBypass detector does not quite cover the returned values when the model is not able to infer the masked word.

Example Usage pattern:

python -m garak -m huggingface.Model -n meta-llama/Llama-2-7b-chat-hf -p artprompt
python -m garak -m huggingface --model_name gpt2 --probes artprompt

The probe pattern could be enhanced to be provided a dataset of prompts to be augmented with a dictionary of unsafe words often blocked by safety training that can be easily maintained as a set of resource files.

jmartin-tech avatar Apr 22 '24 14:04 jmartin-tech

This is looking pretty reasonable. Agree that a probe works well for the case that the paper presents!

leondz avatar May 10 '24 13:05 leondz

Hi, my team was thinking of building this as a Buff for the Apart Deception Hackathon (but, just saw this pull request already existed!) - is there a list anywhere of what's left to do for it? (E.g, adding configurable safety words?)

zazer0 avatar Jun 28 '24 09:06 zazer0

@zazer0, the current implementation for the probe is mostly complete, a plan for configurable prompts will likely be worked on after #602.

The primary reason this is still in draft is that work is still needed to generate a better detector for evaluating the responses from a prompt offered by this probe. The current detectors look for a mitigation response however this probe would needs additional filtering to determine if the response was able to identify the safety word masked in the prompt as failure to decode would not represent a finding of successful bypass of alignment or mitigation.

jmartin-tech avatar Jun 28 '24 15:06 jmartin-tech

Given the upcoming payloads work on decoupling content from transformation, can it make sense to put this through as a probe in this PR? And defer (in separate PRs):

a. making the encoded texts accessible via the payload mechanism; b. a buff

leondz avatar Aug 28 '24 08:08 leondz