OpenAI Reveals ChatGPT Prompt Injection Defenses

The Agent Security Challenge

As AI models evolve from passive chatbots into autonomous agents capable of browsing the web, executing code, and managing files, the security stakes have risen dramatically. A chatbot that gives a wrong answer is an inconvenience. An agent that takes a wrong action — sending an email, deleting a file, executing a transaction — because an attacker manipulated its instructions could cause real harm. OpenAI has now published a detailed technical blog post explaining how it designs ChatGPT's agent capabilities to resist prompt injection and social engineering attacks.

Prompt injection is a class of attack where malicious instructions are embedded in data that an AI agent processes. For example, an attacker might hide instructions in a web page, email, or document that tell the agent to ignore its original instructions and perform unauthorized actions instead. When the agent reads and processes this content, it may follow the injected instructions, potentially leaking sensitive data or taking harmful actions on behalf of the attacker.

Defense in Depth

OpenAI's approach to defending against prompt injection in agent workflows follows a defense-in-depth strategy with multiple overlapping layers. No single defense is considered sufficient on its own; the system relies on the combination of multiple mechanisms to provide robust protection even if individual layers are bypassed.

The first layer is instruction hierarchy. ChatGPT's agent capabilities are designed to treat instructions from different sources with different levels of trust. System-level instructions from the application developer receive the highest trust. User instructions receive moderate trust. And content from external sources — web pages, emails, documents — receives the lowest trust. When instructions from a lower-trust source conflict with those from a higher-trust source, the higher-trust instructions take precedence.

This hierarchy means that even if a web page contains text saying "ignore your previous instructions," ChatGPT's agent will recognize these as low-trust external instructions that cannot override system or user-level directives.

AI & Robotics

Anthropicによると、より強力なAIエージェントは社内の実市場でより良い価格を交渉し、より多くの取引を成立させた一方、弱いモデルに代表されたユーザーは公平性の差に気づかなかった。

DT Editorial AI·Apr 25, 2026·via the-decoder.com

AI & Robotics

北京は、国家の承認が先にない限り米国マネーを受け入れないよう民間テック企業に求めていると報じられており、戦略的重要性の高いAI資産や所有権を、より厳しい国内管理下に置く動きをさらに進めている。

DT Editorial AI·Apr 25, 2026·via the-decoder.com

Constraining Risky Actions

The second major defense mechanism involves constraining the actions that agents can take in response to external content. OpenAI categorizes agent actions along a risk spectrum, from low-risk read-only operations like searching the web to high-risk operations like sending emails, making purchases, or modifying files.

High-risk actions require explicit user confirmation before execution, regardless of what instructions the agent has received. This creates a human-in-the-loop checkpoint that prevents automated exploitation even if an attacker successfully injects instructions that the agent's other defenses fail to catch.

For medium-risk actions, the system applies contextual analysis to determine whether the requested action is consistent with the user's original intent. If an agent is asked to summarize web pages and one of those pages contains instructions to draft an email, the contextual mismatch triggers additional scrutiny and user confirmation.