Adversarial Prompting | behavior.engineering

Adversarial prompting is the practice of intentionally trying to break a model — finding inputs that cause it to ignore safety guidelines, leak information, or produce harmful content. This can range from simple tricks like rephrasing a banned request in a hypothetical frame (“write a story where a character explains how to…”) to more sophisticated multi-turn strategies that gradually shift the model’s behavior. Adversarial prompting is used both maliciously, by bad actors, and constructively, by red teamers trying to find and fix vulnerabilities before deployment. For behavior architects, understanding adversarial prompting is essential for building robust systems — if you don’t test your model against creative attack strategies, someone else will.