probes: add ArtPrompt probes
Fix #535
Implements two prompt obfuscation patterns based on ArtPrompt
Testing work is still in progress here a new possible base case or detector specific to technique, may be needed. As current mitigation.MitigationBypass detector does not quite cover the returned values when the model is not able to infer the masked word.
Example Usage pattern:
python -m garak -m huggingface.Model -n meta-llama/Llama-2-7b-chat-hf -p artprompt
python -m garak -m huggingface --model_name gpt2 --probes artprompt
The probe pattern could be enhanced to be provided a dataset of prompts to be augmented with a dictionary of unsafe words often blocked by safety training that can be easily maintained as a set of resource files.
This is looking pretty reasonable. Agree that a probe works well for the case that the paper presents!
Hi, my team was thinking of building this as a Buff for the Apart Deception Hackathon (but, just saw this pull request already existed!) - is there a list anywhere of what's left to do for it? (E.g, adding configurable safety words?)
@zazer0, the current implementation for the probe is mostly complete, a plan for configurable prompts will likely be worked on after #602.
The primary reason this is still in draft is that work is still needed to generate a better detector for evaluating the responses from a prompt offered by this probe. The current detectors look for a mitigation response however this probe would needs additional filtering to determine if the response was able to identify the safety word masked in the prompt as failure to decode would not represent a finding of successful bypass of alignment or mitigation.
Given the upcoming payloads work on decoupling content from transformation, can it make sense to put this through as a probe in this PR? And defer (in separate PRs):
a. making the encoded texts accessible via the payload mechanism; b. a buff