jailbreak

A jailbreak is an adversarial prompt crafted to bypass the intended safety policies, role-instructions, or guardrails of a model system, such as an large language model (LLM), so that it produces disallowed or unintended behaviour.

In practice, jailbreaks may take the form of carefully structured instructions, role-play scenarios, obfuscation or encoding tricks, or multi-step workflows that cause the system to misinterpret or override its internal constraints. These may be introduced directly by a user or indirectly, for example, embedded in retrieved or linked content that the system ingests.

Effective mitigation requires layered defense, such as input and output filtering, policy-aware parsing, instruction isolation, hierarchical roles, allow and deny lists for tool invocation, execution under least privilege, provenance checks of content, and continuous evaluation to identify risk.


By Leodanis Pozo Ramos • Updated Oct. 29, 2025