AI Must Finish Tasks to Become a Coworker

From chatbot to “digital colleague”

A new survey paper from Tencent’s Youtu Lab and several Chinese universities makes a blunt argument about the next stage of artificial intelligence: better answers are not enough. If AI systems are going to function as real coworkers, the researchers say, they need to move beyond response generation and reliably complete full tasks inside persistent work environments.

That shift, described in a June 28 report from The Decoder, reframes one of the central questions in AI development. The issue is no longer just whether a model can produce a more fluent or more accurate reply. It is whether the model can take a user’s intent, interact with tools and files, adapt to unexpected conditions, and keep going until the work is actually done.

In the paper’s terms, the destination is a “digital colleague” rather than a chatbot. That sounds like branding language on first read, but the underlying distinction is practical. A chatbot answers. A coworker executes.

The limit of one-shot intelligence

The survey traces the evolution of large language models through multiple stages. In the earliest phase, systems mainly generated text quickly by predicting the most likely next token. Their capabilities depended heavily on patterns and information compressed into model parameters. That made them useful for drafting, summarizing, and general question answering, but it also imposed obvious limits.

According to The Decoder’s summary of the paper, those systems typically did not search broadly for solutions, validate intermediate steps, or maintain a durable sense of state while solving problems. They produced outputs in one pass, and that meant their reliability often collapsed when a task required multiple dependent actions or verification across time.

The researchers describe a later “thinking LLM” stage in which models spend more compute during inference to explore solution paths, check intermediate reasoning, and correct mistakes. The report links that phase to systems such as OpenAI’s o1 and DeepSeek-R1, which are framed as moving from fast, intuitive behavior toward slower, more deliberate reasoning.

That change matters, but the paper argues it is still not sufficient. Better reasoning improves the quality of an answer. It does not automatically create a dependable agent that can operate inside a real workflow.

Illustrated mountain path showing five stages of AI system evolution, from chatbot through thinking LLM, agent, and OpenClaw to the summit labeled Next Paradigm for human-AI partnership. — The paper traces the evolution of large language models through five stages, from basic chatbot to autonomous digital colleague. | Image: Tencent Youtu Lab

CEO-Bench Tests Whether AI Can Run a Startup for 500 Days

A Princeton-led benchmark puts AI agents in charge of a fictional software company for 500 simulated days and finds that most still fail at long-horizon management.

Read article

Why agents still break

The survey identifies four structural weaknesses in first-generation AI agents. As summarized by The Decoder, those agents perceive their environment only in fragments, fail to preserve lasting state across tool calls, break when something unexpected happens, and often do not finish tasks.

Those problems are familiar to anyone who has tried to use an LLM as an autonomous assistant for coding, research, file operations, or administrative work. A model may invoke an API, open a browser, or write code, yet still stall because it loses track of what changed, cannot recover from a small error, or lacks a stable workspace in which prior actions remain available.

The paper’s answer is environmental as much as cognitive. It points toward persistent, secure workspaces where files, sessions, logs, permissions, browser state, and reusable skills remain available across the entire task. In that setup, the model is not merely producing isolated tool calls. It is operating within a continuity of context.

The role of reusable skills

One of the paper’s strongest ideas is that progress toward dependable AI coworkers hinges on reusable “skills.” The Decoder presents this as a core requirement for turning intent into finished work. Skills, in this framing, are not vague talents but portable task procedures the system can apply repeatedly inside a stable environment.

That emphasis is notable because it shifts the benchmark for AI usefulness. The industry has often rewarded impressive single-turn performance: a better summary, a sharper answer, a more polished block of code. The survey argues that real utility lies elsewhere. The valuable system is the one that can execute a sequence of actions again and again with enough consistency to be trusted.

Persistent environments make those skills possible. If files, logs, permissions, and task context disappear after each action, the model has to reconstruct the world over and over. If that state persists, the system can build routines, verify outcomes, and recover from failure without restarting from scratch.

J.P. Morgan sees a pile of red flags in the AI market

J.P. Morgan Warns of AI Market Concentration Risks

J.P. Morgan says signs of investor exuberance are building across AI-related markets, with chip stocks, leverage, and concentration echoing past bubble dynamics.

Read article

A change in how AI progress is measured

The report suggests that this marks a deeper change in AI evaluation. Under the older chatbot paradigm, progress could be measured by answer quality: fluency, factuality, coding accuracy, or benchmark scores on discrete problems. Under the “digital colleague” paradigm, success has to be measured against completed tasks.

Diagram of the thinking LLM era showing input, a reasoning core with branching thought tree, error detection, and backtracking, plus a structured chain-of-thought output. — Thinking LLMs invest extra compute at inference time, exploring solution paths, verifying intermediate steps, and correcting errors before the final answer. | Image: Tencent Youtu Lab

That is a harder standard. Finished work requires the model to understand the goal, choose tools, maintain state, detect errors, verify outputs, and stop only when completion criteria are met. It also requires some degree of robustness in messy, real-world conditions where the environment can change under the model’s feet.

The survey reportedly cites systems such as OpenHands and SWE-agent as examples connected to this newer era, which The Decoder labels the “OpenClaw era.” The naming matters less than the architectural point: AI systems become more capable when they are embedded in environments designed for sustained execution rather than isolated text generation.

Why this matters now

The paper lands in the middle of a broader industry transition. AI companies continue to improve models’ reasoning and multimodal capabilities, but enterprise buyers and technical teams increasingly care about labor substitution in narrow workflows: can the system resolve a ticket, update a codebase, gather documents, run checks, and produce a verifiable result without constant supervision?

The survey’s answer is that this will not happen simply by scaling the same chatbot pattern. More intelligence at the point of answering helps, but it does not eliminate the need for persistent context, durable state, tool grounding, and reusable execution patterns.

That position also clarifies why some demos feel more impressive than the products they are meant to represent. A model can look highly capable when shown solving a single polished prompt. It becomes much less convincing when asked to navigate an entire work process with interruptions, ambiguity, and the need for verification.

General Intuition is testing world models that will act as training environments for agentic models.

General Intuition raises $320 million to train robots with gameplay data

General Intuition has raised a $320 million Series A to build AI models for virtual and physical environments using billions of gameplay clips with embedded action labels.

Read article

The practical takeaway

The most useful contribution of the survey may be conceptual discipline. It gives language to a problem many users already see: AI often performs as a brilliant respondent and an unreliable finisher. By separating answer generation from task completion, the paper points developers toward the infrastructure and product design choices that matter for closing that gap.

If the researchers are right, the next major leap in AI will not be defined only by smarter models. It will be defined by systems that can persist, act, remember, and verify long enough to turn instructions into completed work. In other words, the future coworker will need to do more than talk like one.

This article is based on reporting by The Decoder. Read the original article.

Originally published on the-decoder.com