#benchmarks

All articles tagged with "benchmarks"

Benchmark Finds AI Systems Often Answer Correctly but Cite the Wrong Evidence

A new benchmark called CiteVQA shows that leading AI models frequently give accurate document answers while failing to identify the actual supporting passage, a gap researchers call attribution hallucination.

Key Takeaways

CiteVQA measures both answer correctness and citation correctness in long documents.
A correct answer with a wrong citation receives no credit under the benchmark’s strict metric.

DT Editorial Team·May 25, 2026·via the-decoder.com

Mathematicians Build a Tougher AI Test by Including Problems With No Valid Answer

A new benchmark called SOOHAK pushes frontier AI models beyond standard competition math by testing research-level reasoning and whether systems can recognize when a problem has no solution at all.

Key Takeaways

SOOHAK contains 439 original math tasks, including 99 deliberately unsolvable ones.
Gemini 3 Pro led the challenge set at 30%, with GPT-5 variants at 26%.

DT Editorial Team·May 18, 2026·via the-decoder.com

A new exploit benchmark shows frontier AI models edging deeper into offensive security

Researchers at Carnegie Mellon University say a new benchmark measuring real V8 exploitation found Anthropic's Claude Mythos Preview well ahead of OpenAI's GPT-5.5, though at far higher cost.

Key Takeaways

Carnegie Mellon researchers built a benchmark that measures progress from vulnerability work to full code execution in V8.
Claude Mythos Preview outperformed GPT-5.5 in both assisted and fully autonomous modes.

DT Editorial Team·May 16, 2026·via the-decoder.com

New Benchmark Shows Why Better-Looking AI Video Still Fails at Basic World Logic

A new academic benchmark shifts attention away from cinematic polish and toward whether AI video models can continue scenes in ways that obey physics, social norms, and basic logic.

DT Editorial Team·May 16, 2026·via the-decoder.com

AI & Robotics

A large study of more than 34,000 real-world agent skills suggests the modular instructions praised in benchmark settings deliver far smaller gains when models must find and apply them on their own.

DT Editorial Team·Apr 12, 2026·via the-decoder.com

AI & Robotics

A new benchmark built to test whether multimodal AI systems know when to ask for more information shows that most current models still guess, hallucinate, or refuse instead of requesting the visual context they need.

DT Editorial Team·Apr 12, 2026·via the-decoder.com

AI & Robotics

A new analysis found Google’s AI Overviews answered benchmark questions correctly about nine times out of 10, but the remaining error rate could still translate into a vast volume of wrong answers.

DT Editorial Team·Apr 7, 2026·via arstechnica.com

News

Apple's new entry-level MacBook Neo with 512GB unified memory delivered benchmark results comparable to powerful cloud instances in DuckDB database workloads, raising fresh questions about the economics of cloud computing.

DT Editorial Team·Mar 18, 2026·4 min read·via 9to5mac.com

News

Early Geekbench results for the iPhone 17e's A19 processor reveal notable single-core and multi-core performance improvements over its predecessor.

DT Editorial Team·Mar 7, 2026·3 min read·via 9to5mac.com

News

Anthropic's latest mid-tier model, Sonnet 4.6, debuts with record scores in software engineering and computer use benchmarks, plus a doubled context window of one million tokens. The release becomes the new default for free and pro users.

DT Editorial Team·Feb 17, 2026·5 min read·via techcrunch.com

#benchmarks

Benchmark Finds AI Systems Often Answer Correctly but Cite the Wrong Evidence

Mathematicians Build a Tougher AI Test by Including Problems With No Valid Answer

A new exploit benchmark shows frontier AI models edging deeper into offensive security

New Benchmark Shows Why Better-Looking AI Video Still Fails at Basic World Logic

Benchmark Finds AI Systems Often Answer Correctly but Cite the Wrong Evidence

Mathematicians Build a Tougher AI Test by Including Problems With No Valid Answer

A new exploit benchmark shows frontier AI models edging deeper into offensive security

New Benchmark Shows Why Better-Looking AI Video Still Fails at Basic World Logic

AI agent “skills” show limited gains once testing looks more like the real world

AI Models Still Prefer Guessing Over Asking for Help, Benchmark Finds

Google AI Overviews Clears 90% Accuracy in Benchmark, but Scale Turns Errors Into a Major Problem

MacBook Neo Matches Cloud Servers in Database Tests

iPhone 17e A19 Chip Shows Strong Benchmark Gains

Anthropic Releases Claude Sonnet 4.6 with Record Benchmarks and Million-Token Context