refusal topic
List
refusal repositories
activation-steering
129
Stars
22
Forks
129
Watchers
[ICLR 2025] General-purpose activation steering library
sorry-bench
70
Stars
6
Forks
70
Watchers
Benchmark evaluation code for "SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal" (ICLR 2025)