Code, Not Just Language, Is Emerging as the Core Substrate for AI Agents

A new review paper from researchers at Meta, Stanford, and the University of Illinois Urbana-Champaign makes a direct argument about how modern AI agents actually work: code is no longer just an output format, but a central medium through which agents reason, act, and coordinate. The paper, as described by The Decoder, shifts attention away from the language model alone and toward the surrounding software system that turns a stateless model into an operating agent.

The authors call that surrounding layer the “harness.” It includes tools, interfaces, sandboxed execution environments, memory, permission boundaries, testing infrastructure, execution loops, and feedback channels. Their point is straightforward: without that scaffolding, a model remains a generator of responses. With it, the model can iteratively plan, execute, inspect results, and continue working over longer task horizons.

Why the Harness Matters

The review frames long-running agent systems as a combination of three parts. First are the model’s native capabilities, such as planning and reasoning. Second is the infrastructure provided around the model. Third is the code the agent writes or uses while working, including scripts, helper tools, tests, workflows, and reusable skills. In that framing, the bottleneck for more capable agents may increasingly be the reliability and transparency of the software environment rather than the model in isolation.

The authors argue that code has several properties that make it especially useful for agent behavior. It is executable, which means outputs can be turned into operations that can be checked. It is traceable, because intermediate steps can be recorded as structured artifacts. And it is persistent, allowing agents to store progress in a form they can pick up across multiple steps.

That view helps explain why current commercial systems increasingly blur the line between model and software runtime. The Decoder notes that systems such as Claude Code and OpenAI’s Codex already operate on this principle, relying on tool use and controlled execution rather than treating model responses as a final endpoint.

Execution Brings New Risks

The paper does not present the harness as a simple solution. The authors also warn that current software tests can create a false sense of confidence. Incomplete or narrow test suites may allow systems to appear trustworthy while masking failure modes, especially when agents are producing or modifying code as they go.

That concern matters because tests and execution traces are often treated as objective signals of success. The review argues that they need more transparent evaluation mechanisms, not just more automation. In practice, that means scrutiny of what the agent was allowed to do, what it actually did, what evidence was collected, and which kinds of failures may have been ignored.

The broader implication is that AI safety and capability are becoming more tightly linked to engineering discipline. Sandboxes, permissions, logging, test design, and tool boundaries are no longer peripheral implementation details. They are part of the system’s intelligence and part of its risk surface.

A Reframing for the AI Industry

This reframing arrives at a moment when agentic systems are moving from demos into operational products. If the paper’s thesis holds, the next major gains in autonomy may come less from scaling models alone and more from improving the software structures around them. Better tool interfaces, stronger memory systems, clearer permissions, more rigorous test environments, and more faithful audit trails could all matter as much as another jump in model size.

It also suggests that evaluation standards will need to evolve. Measuring an agent only by a benchmark score or a single-turn response misses the role of infrastructure in determining whether the system can complete real tasks safely and reliably. The paper’s emphasis on executable workflows and harness design points toward a more systems-level view of AI performance.

For developers and companies building agents, the message is practical. If code is part of how agents think and act, then the quality of the runtime around the model becomes a first-order product decision. That includes what tools are exposed, how outputs are verified, how memory is stored, and how much operational freedom an agent is granted.

The review does not argue that models no longer matter. Instead, it argues that capability emerges from the interaction between model and environment. In that sense, the harness is not an accessory. It is the mechanism that turns prediction into sustained action.

This article is based on reporting by The Decoder. Read the original article.

Originally published on the-decoder.com