benchmarks Articles | Developments Today

Agent skills look great in benchmarks but fall apart under realistic conditions, researchers find

AI agent “skills” show limited gains once testing looks more like the real world

A large study of more than 34,000 real-world agent skills suggests the modular instructions praised in benchmark settings deliver far smaller gains when models must find and apply them on their own.

Key Takeaways

Researchers tested 34,198 real-world skills from open-source repositories.
The study argues existing benchmarks overstate gains by handing agents highly task-specific instructions.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

A new benchmark built to test whether multimodal AI systems know when to ask for more information shows that most current models still guess, hallucinate, or refuse instead of requesting the visual context they need.

Key Takeaways

ProactiveBench tests whether multimodal AI models ask for missing visual information.
Across 22 models, average performance fell from 79.8 percent in standard settings to 17.5 percent on the proactive benchmark.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

A new analysis found Google’s AI Overviews answered benchmark questions correctly about nine times out of 10, but the remaining error rate could still translate into a vast volume of wrong answers.

Key Takeaways

A benchmark found Google AI Overviews answered factual questions correctly about 90% of the time.
The reported result improved from an earlier score around 85% after a model update.

DT Editorial AI·Apr 7, 2026·via arstechnica.com

More in AI & Robotics →

MacBook Neo rivalis cloud servers in database workload test - 9to5Mac

News

MacBook Neo Matches Cloud Servers in Database Tests

Apple's new entry-level MacBook Neo with 512GB unified memory delivered benchmark results comparable to powerful cloud instances in DuckDB database workloads, raising fresh questions about the economics of cloud computing.

DT Editorial AI·Mar 18, 2026·4 min read·via 9to5mac.com

News

Early Geekbench results for the iPhone 17e's A19 processor reveal notable single-core and multi-core performance improvements over its predecessor.

DT Editorial AI·Mar 7, 2026·3 min read·via 9to5mac.com

News

Anthropic's latest mid-tier model, Sonnet 4.6, debuts with record scores in software engineering and computer use benchmarks, plus a doubled context window of one million tokens. The release becomes the new default for free and pro users.

DT Editorial AI·Feb 17, 2026·5 min read·via techcrunch.com

AI & Robotics

AI agent “skills” show limited gains once testing looks more like the real world

A large study of more than 34,000 real-world agent skills suggests the modular instructions praised in benchmark settings deliver far smaller gains when models must find and apply them on their own.

Key Takeaways

Researchers tested 34,198 real-world skills from open-source repositories.
The study argues existing benchmarks overstate gains by handing agents highly task-specific instructions.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Key Takeaways

ProactiveBench tests whether multimodal AI models ask for missing visual information.
Across 22 models, average performance fell from 79.8 percent in standard settings to 17.5 percent on the proactive benchmark.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

A new analysis found Google’s AI Overviews answered benchmark questions correctly about nine times out of 10, but the remaining error rate could still translate into a vast volume of wrong answers.

Key Takeaways

A benchmark found Google AI Overviews answered factual questions correctly about 90% of the time.
The reported result improved from an earlier score around 85% after a model update.

DT Editorial AI·Apr 7, 2026·via arstechnica.com

More in AI & Robotics →

News

MacBook Neo Matches Cloud Servers in Database Tests

DT Editorial AI·Mar 18, 2026·4 min read·via 9to5mac.com

News

Early Geekbench results for the iPhone 17e's A19 processor reveal notable single-core and multi-core performance improvements over its predecessor.

DT Editorial AI·Mar 7, 2026·3 min read·via 9to5mac.com

News

DT Editorial AI·Feb 17, 2026·5 min read·via techcrunch.com

#benchmarks

AI agent “skills” show limited gains once testing looks more like the real world

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

MacBook Neo Matches Cloud Servers in Database Tests

AI agent “skills” show limited gains once testing looks more like the real world

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

MacBook Neo Matches Cloud Servers in Database Tests

iPhone 17e A19 Chip Shows Strong Benchmark Gains

Anthropic Releases Claude Sonnet 4.6 with Record Benchmarks and Million-Token Context