
AI 代理“技能”在测试更接近真实世界后显示出有限收益
一项针对超过 34,000 项真实世界代理技能的大型研究表明,在基准测试环境中备受推崇的模块化指令,当模型必须自行发现并应用它们时,所带来的提升要小得多。
- 研究人员测试了来自开源仓库的 34,198 项真实世界技能。
- 该研究认为,现有基准通过直接提供高度任务化的指令,高估了收益。
所有标记为「benchmarks」的文章

一项针对超过 34,000 项真实世界代理技能的大型研究表明,在基准测试环境中备受推崇的模块化指令,当模型必须自行发现并应用它们时,所带来的提升要小得多。

A new benchmark built to test whether multimodal AI systems know when to ask for more information shows that most current models still guess, hallucinate, or refuse instead of requesting the visual context they need.

A new analysis found Google’s AI Overviews answered benchmark questions correctly about nine times out of 10, but the remaining error rate could still translate into a vast volume of wrong answers.

Apple新推出的512GB统一内存MacBook Neo在DuckDB数据库工作负载中提供了与强大云实例相当的基准结果,引发了关于云计算经济学的新问题。
Early Geekbench results for the iPhone 17e's A19 processor reveal notable single-core and multi-core performance improvements over its predecessor.
Anthropic's latest mid-tier model, Sonnet 4.6, debuts with record scores in software engineering and computer use benchmarks, plus a doubled context window of one million tokens. The release becomes the new default for free and pro users.