probe: zws + bad character attacks
Add a new probe using zero-width spaces to implement "bad character". Links for both are below.
see:
- https://en.wikipedia.org/wiki/Zero-width_space
- https://arxiv.org/abs/2106.09898
To get started, have a look at our guide to building with garak:
Nice!
This issue has been automatically marked as stale because it has not had recent activity. If you are still interested in this issue, please respond to keep it open. Thank you!
@leondz I can work on this: My plan is to add a new probe that uses zero-width spaces / similar bad-character patterns as described.
If you have any preferences on specific attack patterns you’d like covered beyond the examples linked, I’m happy to align with that.
@leondz I’d like to confirm my understanding of the expected behavior for this probe and outline the approach I’m planning:
- Implement a zero-width–based bad-characters probe that follows the same payload-driven workflow as
encoding.py. - Load prompt text from the payload Director (including the default bundles plus relevant slur/harm payloads).
- For each payload sentence, randomly inject zero-width spaces inside word boundaries.
- Ensure that any slur/bad-word tokens are consistently injected with zero-width spaces.
Could you please confirm if this matches what you have in mind for this issue, and whether you’d prefer using the existing payload sets directly or a small set of sample prompts that get ZWS “sprayed” on top?
This sounds like a good route, yes. Let's follow the payload system as in garak.probes.encoding.
Note that there are multiple "bad characters" described in the paper, beyond zws. There are four categories in the paper (page 9):
"For the objective functions used in these experiments,
- invisible characters were chosen from a set including ZWSP, ZWNJ, and ZWJ7
- homoglyphs sets were chosen according to the relevant Unicode technical report [64];
- reorderings were chosen from the sets defined using Algorithm 2;
- and deletions were chosen from the set of all non-control ASCII characters followed by a BKSP8 character.
"We define the unit value of the perturbation budget as one injected invisible character, one homoglyph character replacement, one Swap sequence according to the reordering algorithm, or one ASCII-backspace deletion pair."
These four distinct categories should be treated differently. Perturbation budget should be treated as a configurable value in the probe. And it's probably worthwhile taking a look at the source for the research in the paper (https://github.com/nickboucher/imperceptible).
@leondz I’d like to clarify the intended scope for this probe: In the imperceptible repo I’m looking at, they wrap a Fairseq GeneratorHubInterface with SciPy’s differential_evolution (and Levenshtein via textdistance) to optimize perturbations on the fly. The optimizer returns a perturbation vector that effectively encodes “which bad character to use, and at which index to insert it,” and the objective is driven by how much the translation output changes under a small perturbation budget.
If I mirror that pattern directly in garak, the probe would need to bundle a translation model and run a DE optimization loop at runtime, which feels quite different from the existing payload-based probes that operate on precomputed strings and don’t depend on external models or optimizers. On the other hand, if we just randomly insert/spray bad characters, we lose the adaptive placement that makes the original attack effective, and if we precompute all prompts, a configurable perturbation budget becomes harder to interpret consistently.
I’m happy to implement whichever strategy best matches garak’s design goals; I just want to make sure the placement logic and perturbation-budget semantics line up with what you’re expecting.
@Har1sh-k
Great points. Thank you. There are probes in garak that operate with this loop - see e.g. gcg or tap probes. We've used a few routes with this kind of thing before:
-
Implement the optimisation/feedback loop. This should get a fixed inference budget - recommend something like
run.generations*run.soft_probe_prompt_cap. This probably will involve using a custom implementation ofprobe(), and havinggenerator.generate()within the DE loop. It might be easiest to start with just Levenshtein. -
Take a random approach.Simpler to implement and with potentially worse results, as you note. At least the intensity of the replacement should be customisable viaDEFAULT_PARAMS. -
Apply a different optimiser.Bayesian optimisation might also work here, where the parameter space is the location, type, and intensity of the perturbations, and the loss function is the same. -
Just try everything.Produce logic for generating the entire set of prompts, and have therun.soft_probe_prompt_capmechanism reduce this to a prompt count that fits in inference expectations; seeprobes.phrasing.*for an example of how this downsampling can be implemented. -
Used some cached values.Precompute prompts, find some that work on reasonably common targets, and have a static probes that applies these (cf. our cached gcg probe).