refusal topic

List refusal repositories

activation-steering

129
Stars
22
Forks
129
Watchers

[ICLR 2025] General-purpose activation steering library

sorry-bench

70
Stars
6
Forks
70
Watchers

Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)