Jailbreaking | behavior.engineering

Jailbreaking refers to the community practice of finding and sharing prompts that get AI models to ignore their safety training — producing harmful content, revealing system prompts, roleplaying as an “unrestricted” AI, or otherwise behaving contrary to their intended guidelines. Common techniques include roleplay framing (“pretend you have no restrictions”), hypothetical scenarios, encoding requests in unusual formats, or gradually shifting the conversation toward the target behavior over many turns. The AI safety field studies jailbreaking seriously, and both model developers and behavior architects need to understand the most common attack patterns in order to defend against them. A model’s resistance to jailbreaking is an important dimension of behavioral robustness.