benchmarks Articles | Developments Today

Agent skills look great in benchmarks but fall apart under realistic conditions, researchers find

Las “habilidades” de los agentes de IA muestran ganancias limitadas cuando las pruebas se parecen más al mundo real

Un amplio estudio de más de 34,000 habilidades de agentes en el mundo real sugiere que las instrucciones modulares elogiadas en entornos de benchmark ofrecen beneficios mucho menores cuando los modelos deben encontrarlas y aplicarlas por sí mismos.

Key Takeaways

Los investigadores probaron 34,198 habilidades del mundo real procedentes de repositorios de código abierto.
El estudio sostiene que los benchmarks existentes exageran las ganancias al entregar a los agentes instrucciones altamente específicas para la tarea.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

A new benchmark built to test whether multimodal AI systems know when to ask for more information shows that most current models still guess, hallucinate, or refuse instead of requesting the visual context they need.

Key Takeaways

ProactiveBench tests whether multimodal AI models ask for missing visual information.
Across 22 models, average performance fell from 79.8 percent in standard settings to 17.5 percent on the proactive benchmark.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

A new analysis found Google’s AI Overviews answered benchmark questions correctly about nine times out of 10, but the remaining error rate could still translate into a vast volume of wrong answers.

Key Takeaways

A benchmark found Google AI Overviews answered factual questions correctly about 90% of the time.
The reported result improved from an earlier score around 85% after a model update.

DT Editorial AI·Apr 7, 2026·via arstechnica.com

More in AI & Robotics →

MacBook Neo rivalis cloud servers in database workload test - 9to5Mac

News

MacBook Neo Iguala Servidores en la Nube en Pruebas de Bases de Datos

El nuevo MacBook Neo de entrada de Apple con memoria unificada de 512GB entregó resultados de benchmark comparables a instancias poderosas en la nube en cargas de trabajo de bases de datos DuckDB, planteando nuevas preguntas sobre la economía de la computación en la nube.

DT Editorial AI·Mar 18, 2026·4 min read·via 9to5mac.com

News

Early Geekbench results for the iPhone 17e's A19 processor reveal notable single-core and multi-core performance improvements over its predecessor.

DT Editorial AI·Mar 7, 2026·3 min read·via 9to5mac.com

News

Anthropic's latest mid-tier model, Sonnet 4.6, debuts with record scores in software engineering and computer use benchmarks, plus a doubled context window of one million tokens. The release becomes the new default for free and pro users.

DT Editorial AI·Feb 17, 2026·5 min read·via techcrunch.com

AI & Robotics

Las “habilidades” de los agentes de IA muestran ganancias limitadas cuando las pruebas se parecen más al mundo real

Key Takeaways

Los investigadores probaron 34,198 habilidades del mundo real procedentes de repositorios de código abierto.
El estudio sostiene que los benchmarks existentes exageran las ganancias al entregar a los agentes instrucciones altamente específicas para la tarea.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Key Takeaways

ProactiveBench tests whether multimodal AI models ask for missing visual information.
Across 22 models, average performance fell from 79.8 percent in standard settings to 17.5 percent on the proactive benchmark.

DT Editorial AI·Apr 12, 2026·via the-decoder.com

More in AI & Robotics →

AI & Robotics

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

A new analysis found Google’s AI Overviews answered benchmark questions correctly about nine times out of 10, but the remaining error rate could still translate into a vast volume of wrong answers.

Key Takeaways

A benchmark found Google AI Overviews answered factual questions correctly about 90% of the time.
The reported result improved from an earlier score around 85% after a model update.

DT Editorial AI·Apr 7, 2026·via arstechnica.com

More in AI & Robotics →

News

MacBook Neo Iguala Servidores en la Nube en Pruebas de Bases de Datos

DT Editorial AI·Mar 18, 2026·4 min read·via 9to5mac.com

News

Early Geekbench results for the iPhone 17e's A19 processor reveal notable single-core and multi-core performance improvements over its predecessor.

DT Editorial AI·Mar 7, 2026·3 min read·via 9to5mac.com

News

DT Editorial AI·Feb 17, 2026·5 min read·via techcrunch.com

#benchmarks

Las “habilidades” de los agentes de IA muestran ganancias limitadas cuando las pruebas se parecen más al mundo real

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

MacBook Neo Iguala Servidores en la Nube en Pruebas de Bases de Datos

Las “habilidades” de los agentes de IA muestran ganancias limitadas cuando las pruebas se parecen más al mundo real

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

MacBook Neo Iguala Servidores en la Nube en Pruebas de Bases de Datos

iPhone 17e A19 Chip Shows Strong Benchmark Gains

Anthropic Releases Claude Sonnet 4.6 with Record Benchmarks and Million-Token Context