OpenAI Reveals ChatGPT Prompt Injection Defenses

The Agent Security Challenge

As AI models evolve from passive chatbots into autonomous agents capable of browsing the web, executing code, and managing files, the security stakes have risen dramatically. A chatbot that gives a wrong answer is an inconvenience. An agent that takes a wrong action — sending an email, deleting a file, executing a transaction — because an attacker manipulated its instructions could cause real harm. OpenAI has now published a detailed technical blog post explaining how it designs ChatGPT's agent capabilities to resist prompt injection and social engineering attacks.

Prompt injection is a class of attack where malicious instructions are embedded in data that an AI agent processes. For example, an attacker might hide instructions in a web page, email, or document that tell the agent to ignore its original instructions and perform unauthorized actions instead. When the agent reads and processes this content, it may follow the injected instructions, potentially leaking sensitive data or taking harmful actions on behalf of the attacker.

Defense in Depth

OpenAI's approach to defending against prompt injection in agent workflows follows a defense-in-depth strategy with multiple overlapping layers. No single defense is considered sufficient on its own; the system relies on the combination of multiple mechanisms to provide robust protection even if individual layers are bypassed.

The first layer is instruction hierarchy. ChatGPT's agent capabilities are designed to treat instructions from different sources with different levels of trust. System-level instructions from the application developer receive the highest trust. User instructions receive moderate trust. And content from external sources — web pages, emails, documents — receives the lowest trust. When instructions from a lower-trust source conflict with those from a higher-trust source, the higher-trust instructions take precedence.

This hierarchy means that even if a web page contains text saying "ignore your previous instructions," ChatGPT's agent will recognize these as low-trust external instructions that cannot override system or user-level directives.

AI & Robotics

SquareMind afirma que su sistema Swan utiliza robótica e IA para automatizar la dermoscopia de cuerpo completo, con el objetivo de reducir la carga de documentación y mejorar los flujos de trabajo de detección temprana del cáncer de piel en las consultas de dermatología.

DT Editorial AI·Apr 28, 2026·via therobotreport.com

Constraining Risky Actions

The second major defense mechanism involves constraining the actions that agents can take in response to external content. OpenAI categorizes agent actions along a risk spectrum, from low-risk read-only operations like searching the web to high-risk operations like sending emails, making purchases, or modifying files.

High-risk actions require explicit user confirmation before execution, regardless of what instructions the agent has received. This creates a human-in-the-loop checkpoint that prevents automated exploitation even if an attacker successfully injects instructions that the agent's other defenses fail to catch.

For medium-risk actions, the system applies contextual analysis to determine whether the requested action is consistent with the user's original intent. If an agent is asked to summarize web pages and one of those pages contains instructions to draft an email, the contextual mismatch triggers additional scrutiny and user confirmation.

OpenAI Details How ChatGPT Blocks Prompt Injection

The Agent Security Challenge

Defense in Depth

Related Articles

Keep Reading

El “Bob” de IBM señala un nuevo impulso para poner la IA al mando de la economía de la entrega de software

Constraining Risky Actions

Protecting Sensitive Data

Por qué los codificadores importan más a medida que la IA se vuelve multimodal

Model-Level Training

An Ongoing Arms Race

Investigadores de Google advierten que la web abierta se está convirtiendo en una superficie de ataque de inyección de prompts para agentes de IA

Comments (0)

SquareMind recauda 18 millones de dólares para comercializar una plataforma robótica de imagen cutánea para dermatología