Good enough in the lab, risky at internet scale
Google’s AI Overviews appears to be getting more accurate, but a new benchmark result underscores how even visible improvement can still leave a large reliability problem. According to the supplied reporting, a New York Times analysis conducted with startup Oumi found that AI Overviews answered a factual benchmark correctly about 90% of the time, improving from an earlier result in the mid-80% range.
That sounds solid until it is placed in context. Google search operates at enormous scale, and an error rate of roughly 1 in 10 would still imply a substantial stream of incorrect answers. That is the core tension in the story: a system can be measurably better and still create a serious trust problem when it is used so widely.
The benchmark and the dispute around it
The test cited in the reporting used SimpleQA, a question set built around verifiable factual answers. Oumi reportedly began testing when Gemini 2.5 was Google’s top model and found about 85% accuracy. After the Gemini 3 update, AI Overviews reportedly answered 91% of questions correctly.
Google disputed the relevance of the result. A company spokesperson said the study had major holes and argued that SimpleQA does not reflect what people are actually searching for on Google. The company’s position, as described in the supplied text, is that its own evaluations rely on a more tightly vetted variant called SimpleQA Verified.
This disagreement is important because benchmark selection shapes the narrative. A narrow factual test can reveal whether a system invents or mishandles concrete details, but it may not capture the full range of real search behavior. At the same time, search products are routinely trusted for exactly these factual lookups, which means a benchmark of short, checkable questions is not irrelevant simply because it is limited.
Why the miss rate matters
The reporting highlights examples where AI Overviews cited sources but still produced the wrong answer or contradicted the material it referenced. That pattern is especially concerning because it can make errors look more authoritative. Users often treat citation as a signal that the answer has been grounded, even when the system has misread or overconfidently selected among conflicting information.
At search-engine scale, the difference between 90% and 99% accuracy is not academic. It changes the expected volume of false claims dramatically. That is why the practical debate around AI search is no longer just whether these systems can be useful. The harder question is whether their failure rate is low enough for the product position they occupy at the top of results pages.
A product challenge, not just a model challenge
This story is not simply about whether one model improved after an update. It is about how search engines package probabilistic systems as front-door information tools. Traditional search lets users compare sources. AI Overviews compresses that process into a single synthesized answer, which can be efficient when it is right and misleading when it is wrong.
The benchmark result therefore cuts both ways. It gives Google evidence that the product is improving, but it also gives critics a clear argument that the current error rate remains materially significant. Both can be true at once.
For users, the takeaway is straightforward: AI summaries may be increasingly useful, but they are still not reliable enough to treat as unquestioned authority. For Google and its competitors, the pressure is broader. They are no longer being judged only on whether generative search works. They are being judged on whether its remaining mistakes are acceptable at the scale of the modern web.
This article is based on reporting by Ars Technica. Read the original article.



