Benchmark reveals a blind spot in multimodal AI behavior
A new study centered on a benchmark called ProactiveBench argues that many of today’s multimodal AI systems remain fundamentally reactive when they encounter missing visual information. Instead of asking for clarification, a better angle, or an unobstructed view, most models tested either produced incorrect answers or declined to answer at all. The result points to a practical weakness in systems that are increasingly being positioned as assistants for visual tasks, from image analysis to navigation and robotics.
The benchmark was designed around a simple human intuition: when people cannot see enough to answer a question, they ask for what they need. If an object is blocked, they ask for the obstruction to be moved. If an image is noisy, they ask for a clearer one. The researchers behind ProactiveBench wanted to know whether leading multimodal language models do the same. Their answer, based on 22 tested models, was largely no.
What ProactiveBench measures
ProactiveBench converts seven existing datasets into situations that cannot be solved correctly without additional user input. The tasks include hidden objects, unhelpful viewpoints, noisy images, sketches, temporal ambiguities, and scenarios where a model would need to request camera movement or another perspective. The full benchmark includes more than 108,000 images arranged into 18,000 samples.
To make the test stricter, the benchmark removes tasks a model can already solve on the first attempt. That means passing is not just about recognition accuracy. It requires a model to recognize that it lacks enough information and then explicitly ask for the missing piece. In other words, the benchmark is less about what a model sees and more about whether it knows when it cannot see enough.
That distinction matters because many real deployments depend on interaction, not just one-shot prediction. An assistant helping a user identify an item in a cluttered room, a system interpreting a rough sketch, or a machine vision tool working from an unclear image all benefit if the model can request a clearer view rather than improvise.
Performance drops sharply when help is required
According to the reported results, model performance falls dramatically once proactive behavior becomes necessary. In standard reference settings with clearly visible objects, the tested models averaged 79.8 percent accuracy. On ProactiveBench, where success depends on asking for more information, average performance dropped to 17.5 percent.
One of the starkest examples came from a dataset involving occluded objects. There, model accuracy reportedly fell from 98.3 percent in the reference setting to 8.2 percent when the object was hidden and the system needed to ask for help. The underlying recognition capability did not disappear. What failed was the model’s tendency to request the condition needed to use that capability.
The researchers also found that larger models did not consistently do better at this behavior. In at least one comparison described in the report, InternVL3-1B outperformed InternVL3-8B, suggesting that scale alone does not solve the problem. That is an important finding for developers who assume more parameters naturally translate into better judgment under uncertainty.
Why the finding matters
The benchmark highlights a mismatch between how these systems are often described and how they actually behave in edge cases. A helpful assistant is not only one that answers correctly when the input is complete. It is also one that knows when to pause, ask a question, and gather what it needs before acting.
That has implications well beyond benchmarks. In consumer apps, guessing can produce frustrating errors. In workplace settings, it can waste time or create false confidence. In more sensitive contexts, an AI system that hallucinates instead of seeking clarification could create safety or reliability problems. The issue is especially relevant for multimodal systems because visual ambiguity is common in the real world.
Proactive behavior is also a step toward more capable embodied AI systems. A robot or assistant operating in a home, lab, warehouse, or vehicle cannot assume perfect information. It needs a mechanism for uncertainty recognition and for interaction that resolves uncertainty before it proceeds.
A possible path to improvement
The source report indicates that a simple reinforcement learning approach showed signs of improvement, suggesting the failure may be trainable rather than inherent. That matters because it implies model builders may not need entirely new architectures to address the issue. Instead, they may need better objectives, more realistic evaluation environments, and reward structures that value clarification over bluffing.
The research does not suggest that current multimodal systems are incapable of asking questions in general. Rather, it shows that they rarely do so when proactive inquiry is the condition for success. That is a narrower but still consequential weakness. Many benchmark suites reward direct answers, and models appear to have been shaped accordingly.
If the goal is to produce systems that collaborate with users, future training may need to reward behaviors such as identifying uncertainty, requesting more context, and sequencing interaction before final output. Those abilities look less like raw perception and more like judgment.
What this says about the next phase of AI evaluation
ProactiveBench also reflects a broader shift in AI testing. Traditional evaluations ask whether a model can produce the right answer from a fixed prompt. Newer benchmarks increasingly ask whether the model can navigate a situation intelligently when the prompt itself is insufficient. That is a harder and arguably more realistic standard.
For users, the study reinforces a practical lesson: a polished answer from a multimodal model does not necessarily mean the model had enough information to answer well. For developers, the lesson is sharper. If systems are going to function as real assistants, they must be trained and measured on their ability to ask before they answer.
Right now, the evidence from ProactiveBench suggests that most do not.
This article is based on reporting by The Decoder. Read the original article.

