AI agent skills struggle in real-world testing despite benchmark gains

Benchmark wins are colliding with deployment reality

One of the more influential ideas in agentic AI over the past year has been the rise of “skills”: reusable text files that package workflows, best practices, API instructions, and domain-specific guidance so an agent can call on them while completing a task. In controlled evaluations, that idea has looked powerful. A new study highlighted by The Decoder argues the effect is much weaker under realistic conditions.

The researchers, from UC Santa Barbara, MIT CSAIL, and the MIT-IBM Watson AI Lab, tested 34,198 real-world skills gathered from open-source repositories. Their conclusion is blunt: the gains from skills are fragile. When conditions become less curated and more similar to practical deployment, performance improvements shrink sharply and in the hardest settings barely exceed the no-skill baseline.

That matters because skills have become a central promise in the pitch for AI agents. If a general model can dynamically pull in a relevant playbook at the right moment, advocates argue, it should behave more like a specialist without retraining. The new work does not dismiss that vision outright. It does, however, suggest that much of the confidence around it may rest on benchmarks that make the retrieval problem unrealistically easy.

Why earlier benchmarks may have overstated the upside

The study takes aim at SKILLSBENCH, an existing benchmark that supplies agents with hand-curated, task-specific skills. According to the researchers, those materials often contain instructions so closely matched to the assignment that they amount to a near-solution rather than a realistic resource. In one cited example involving flood-day identification at USGS gauging stations, the provided skills reportedly included the exact API, the threshold source, and ready-made code patterns needed to solve the task.

That setup is useful for measuring whether a model can follow instructions. It is much less useful for measuring whether an agent can navigate a messy repository of heterogeneous skills, decide which ones matter, adapt them to the task, and ignore the ones that do not. Real systems rarely receive a perfectly staged bundle of three ideal documents. They must search through noise, ambiguity, and partial overlap.

The distinction is not academic. If an agent only looks strong when it is effectively handed a tailored recipe, then benchmark success says little about how it will behave in an enterprise codebase, an open-source tooling environment, or a general productivity workflow where relevant instructions may be incomplete, poorly named, or missing altogether.

A larger and noisier test set changes the picture

To examine that gap, the researchers assembled a large corpus of permissively licensed skills from skillhub.club and skills.sh, deduplicated them, and tested models in conditions closer to practical use. Instead of presenting the right instructions directly, the evaluation forced agents to identify and use skills from a much broader collection.

That change appears to have done most of the work. Under those more realistic constraints, the gains that looked substantial in benchmark-friendly settings largely evaporated. The study also reports an important secondary effect: weaker models can perform worse with skills. That suggests retrieval and application are not free benefits. They add another layer of reasoning burden, and models that are already brittle may be distracted rather than helped by extra instructions.

In practical terms, that means skills may improve outcomes only when several separate systems are already functioning well at once:

The agent must recognize that a skill is needed.
It must retrieve the right one from a noisy pool.
It must understand which parts are relevant.
It must adapt those instructions to the specific task at hand.
It must avoid being misled by irrelevant or overly general guidance.

If any one of those steps fails, the expected benefit can collapse.

What this changes for the agent stack

The study does not show that skills are useless. It shows that the hardest part of the problem may not be writing modular instructions, but orchestrating reliable discovery and application. That shifts attention away from the idea of skills as a standalone capability and toward the surrounding infrastructure: indexing, ranking, retrieval quality, task decomposition, and evaluation design.

It also complicates the competitive narrative that has built up around agent platforms. Anthropic introduced skills for Claude Code in late 2025, and the concept quickly spread to OpenAI’s Codex ecosystem and a range of open-source projects. The rapid adoption made skills seem like an emerging standard for agent extensibility. This study suggests the standard may still be immature, especially if its strongest public evidence came from favorable test setups.

For teams deploying agents, the takeaway is more operational than philosophical. Skills may still be valuable in tightly scoped environments with clean repositories and consistent naming. They may also work well when a human curates a smaller set around a specific workflow. But the results imply that throwing tens of thousands of skills at an agent and expecting robust self-serve specialization is not yet a solved problem.

That is a meaningful correction for a field that often treats modularity as an automatic upgrade. In agent systems, modularity only helps if the model can navigate it. The newest evidence suggests that benchmark-friendly promise and production-grade usefulness are still far apart.

This article is based on reporting by The Decoder. Read the original article.

Originally published on the-decoder.com

AI agent “skills” show limited gains once testing looks more like the real world

Benchmark wins are colliding with deployment reality

Why earlier benchmarks may have overstated the upside

A larger and noisier test set changes the picture

What this changes for the agent stack

Comments (0)

Keep Reading