prompt injection
Prompt injection is an attack where adversarial text is crafted to steer a model or model-powered app into ignoring its original instructions and performing unintended actions, such as leaking secrets, executing unsafe steps, or following attacker-supplied goals.
Variants include direct prompt injection, where malicious instructions are entered through the model interface, and indirect prompt injection, where instructions are hidden in retrieved or linked content that the system ingests during workflows like RAG or tool use. Prompt injection exploits the lack of a strict boundary between instructions and data.
Mitigations should adopt a defense-in-depth approach, including:
- Input and output filtering and sanitisation
- Isolating and clearly delineating system and user instructions
- Enforcing least-privilege access for tool integrations and sandboxes
- Implementing allow or deny lists for tool-use
- Verifying provenance or trustworthiness of external content
- Hardening retrieval pipelines
- Monitoring and adversarial testing to detect residual risk
By Leodanis Pozo Ramos • Updated Oct. 24, 2025