garak icon indicating copy to clipboard operation
garak copied to clipboard

probe: zws + bad character attacks

Open leondz opened this issue 2 years ago • 7 comments

Add a new probe using zero-width spaces to implement "bad character". Links for both are below.

see:

  • https://en.wikipedia.org/wiki/Zero-width_space
  • https://arxiv.org/abs/2106.09898

To get started, have a look at our guide to building with garak:

leondz avatar Jul 19 '23 22:07 leondz

Nice!

wearetyomsmnv avatar Jul 19 '23 22:07 wearetyomsmnv

This issue has been automatically marked as stale because it has not had recent activity. If you are still interested in this issue, please respond to keep it open. Thank you!

github-actions[bot] avatar Sep 24 '25 00:09 github-actions[bot]

@leondz I can work on this: My plan is to add a new probe that uses zero-width spaces / similar bad-character patterns as described.

If you have any preferences on specific attack patterns you’d like covered beyond the examples linked, I’m happy to align with that.

Har1sh-k avatar Nov 10 '25 16:11 Har1sh-k

@leondz I’d like to confirm my understanding of the expected behavior for this probe and outline the approach I’m planning:

  • Implement a zero-width–based bad-characters probe that follows the same payload-driven workflow as encoding.py.
  • Load prompt text from the payload Director (including the default bundles plus relevant slur/harm payloads).
  • For each payload sentence, randomly inject zero-width spaces inside word boundaries.
  • Ensure that any slur/bad-word tokens are consistently injected with zero-width spaces.

Could you please confirm if this matches what you have in mind for this issue, and whether you’d prefer using the existing payload sets directly or a small set of sample prompts that get ZWS “sprayed” on top?

Har1sh-k avatar Nov 11 '25 07:11 Har1sh-k

This sounds like a good route, yes. Let's follow the payload system as in garak.probes.encoding.

Note that there are multiple "bad characters" described in the paper, beyond zws. There are four categories in the paper (page 9):

"For the objective functions used in these experiments,

  1. invisible characters were chosen from a set including ZWSP, ZWNJ, and ZWJ7
  2. homoglyphs sets were chosen according to the relevant Unicode technical report [64];
  3. reorderings were chosen from the sets defined using Algorithm 2;
  4. and deletions were chosen from the set of all non-control ASCII characters followed by a BKSP8 character.

"We define the unit value of the perturbation budget as one injected invisible character, one homoglyph character replacement, one Swap sequence according to the reordering algorithm, or one ASCII-backspace deletion pair."

These four distinct categories should be treated differently. Perturbation budget should be treated as a configurable value in the probe. And it's probably worthwhile taking a look at the source for the research in the paper (https://github.com/nickboucher/imperceptible).

leondz avatar Nov 11 '25 09:11 leondz

@leondz I’d like to clarify the intended scope for this probe: In the imperceptible repo I’m looking at, they wrap a Fairseq GeneratorHubInterface with SciPy’s differential_evolution (and Levenshtein via textdistance) to optimize perturbations on the fly. The optimizer returns a perturbation vector that effectively encodes “which bad character to use, and at which index to insert it,” and the objective is driven by how much the translation output changes under a small perturbation budget.

If I mirror that pattern directly in garak, the probe would need to bundle a translation model and run a DE optimization loop at runtime, which feels quite different from the existing payload-based probes that operate on precomputed strings and don’t depend on external models or optimizers. On the other hand, if we just randomly insert/spray bad characters, we lose the adaptive placement that makes the original attack effective, and if we precompute all prompts, a configurable perturbation budget becomes harder to interpret consistently.

I’m happy to implement whichever strategy best matches garak’s design goals; I just want to make sure the placement logic and perturbation-budget semantics line up with what you’re expecting.

Har1sh-k avatar Nov 13 '25 20:11 Har1sh-k

@Har1sh-k

Great points. Thank you. There are probes in garak that operate with this loop - see e.g. gcg or tap probes. We've used a few routes with this kind of thing before:

  1. Implement the optimisation/feedback loop. This should get a fixed inference budget - recommend something like run.generations * run.soft_probe_prompt_cap. This probably will involve using a custom implementation of probe(), and having generator.generate() within the DE loop. It might be easiest to start with just Levenshtein.
  2. Take a random approach. Simpler to implement and with potentially worse results, as you note. At least the intensity of the replacement should be customisable via DEFAULT_PARAMS.
  3. Apply a different optimiser. Bayesian optimisation might also work here, where the parameter space is the location, type, and intensity of the perturbations, and the loss function is the same.
  4. Just try everything. Produce logic for generating the entire set of prompts, and have the run.soft_probe_prompt_cap mechanism reduce this to a prompt count that fits in inference expectations; see probes.phrasing.* for an example of how this downsampling can be implemented.
  5. Used some cached values. Precompute prompts, find some that work on reasonably common targets, and have a static probes that applies these (cf. our cached gcg probe).

leondz avatar Nov 14 '25 04:11 leondz