OpenAI Reveals ChatGPT Prompt Injection Defenses

The Agent Security Challenge

As AI models evolve from passive chatbots into autonomous agents capable of browsing the web, executing code, and managing files, the security stakes have risen dramatically. A chatbot that gives a wrong answer is an inconvenience. An agent that takes a wrong action — sending an email, deleting a file, executing a transaction — because an attacker manipulated its instructions could cause real harm. OpenAI has now published a detailed technical blog post explaining how it designs ChatGPT's agent capabilities to resist prompt injection and social engineering attacks.

Prompt injection is a class of attack where malicious instructions are embedded in data that an AI agent processes. For example, an attacker might hide instructions in a web page, email, or document that tell the agent to ignore its original instructions and perform unauthorized actions instead. When the agent reads and processes this content, it may follow the injected instructions, potentially leaking sensitive data or taking harmful actions on behalf of the attacker.

Defense in Depth

OpenAI's approach to defending against prompt injection in agent workflows follows a defense-in-depth strategy with multiple overlapping layers. No single defense is considered sufficient on its own; the system relies on the combination of multiple mechanisms to provide robust protection even if individual layers are bypassed.

The first layer is instruction hierarchy. ChatGPT's agent capabilities are designed to treat instructions from different sources with different levels of trust. System-level instructions from the application developer receive the highest trust. User instructions receive moderate trust. And content from external sources — web pages, emails, documents — receives the lowest trust. When instructions from a lower-trust source conflict with those from a higher-trust source, the higher-trust instructions take precedence.

This hierarchy means that even if a web page contains text saying "ignore your previous instructions," ChatGPT's agent will recognize these as low-trust external instructions that cannot override system or user-level directives.

AI & Robotics

A OpenAI está oferecendo até US$ 25.000 por um jailbreak universal que supere um desafio de segurança biológica de cinco perguntas no GPT-5.5, transformando o red teaming externo em um teste focado das salvaguardas de modelos de fronteira.

DT Editorial AI·Apr 25, 2026·via openai.com

AI & Robotics

Um novo guia da OpenAI Academy trata o Codex menos como uma demo e mais como uma ferramenta orientada a projetos, com foco em pastas locais, controle de permissões, primeiras tarefas simples e construção gradual de confiança.

DT Editorial AI·Apr 25, 2026·via openai.com

AI & Robotics

A aquisição planejada da Aleph Alpha pela Cohere é mais do que a compra de uma startup. É uma aposta para construir um fornecedor de IA soberana com respaldo político para governos e setores regulados na Europa e além.

DT Editorial AI·Apr 25, 2026·via the-decoder.com

AI & Robotics

Constraining Risky Actions

The second major defense mechanism involves constraining the actions that agents can take in response to external content. OpenAI categorizes agent actions along a risk spectrum, from low-risk read-only operations like searching the web to high-risk operations like sending emails, making purchases, or modifying files.

High-risk actions require explicit user confirmation before execution, regardless of what instructions the agent has received. This creates a human-in-the-loop checkpoint that prevents automated exploitation even if an attacker successfully injects instructions that the agent's other defenses fail to catch.

For medium-risk actions, the system applies contextual analysis to determine whether the requested action is consistent with the user's original intent. If an agent is asked to summarize web pages and one of those pages contains instructions to draft an email, the contextual mismatch triggers additional scrutiny and user confirmation.

OpenAI Details How ChatGPT Blocks Prompt Injection

The Agent Security Challenge

Defense in Depth

Related Articles

Keep Reading

OpenAI diz que sessões persistentes de WebSocket reduziram em cerca de 40% a latência do loop de agentes

Constraining Risky Actions

Protecting Sensitive Data

A OpenAI posiciona os Workspace Agents como a próxima camada da IA corporativa do dia a dia

Model-Level Training

An Ongoing Arms Race

OpenAI torna gratuito o ChatGPT para Clínicos para profissionais de saúde verificados nos EUA

Comments (0)

Os Emirados Árabes Unidos querem IA agêntica em metade do governo em dois anos

OpenAI coloca as salvaguardas biológicas do GPT-5.5 em um teste de estresse ao vivo com uma nova recompensa por bugs

A OpenAI publica um guia inicial do Codex enquanto aposta em um onboarding mais prático para fluxos de trabalho de IA

O acordo da Cohere com a Aleph Alpha transforma a IA soberana em uma estratégia transfronteiriça

OpenAI avança ainda mais em fluxos de trabalho agentivos com o lançamento do GPT-5.5