Beyond Language in Autonomous Driving
For years, researchers have pursued the idea that autonomous vehicles should "think" in human-readable language — planning maneuvers by generating textual descriptions of road scenarios, intentions, and actions. Vision-Language-Action (VLA) models have dominated recent self-driving research by combining visual perception with language-based reasoning to produce driving commands.
Now, a new research paper challenges that assumption head-on. LatentVLA, a latent reasoning architecture for autonomous driving, argues that forcing an AI driver to articulate its reasoning in natural language introduces unnecessary bottlenecks — and that a system which reasons in compressed latent space can outperform language-based planners on critical driving benchmarks.
The Problem With Verbal Reasoning
Current VLA models process camera feeds, interpret the scene using a vision encoder, and then generate language tokens that describe the driving situation before outputting steering and acceleration commands. This chain-of-thought approach mirrors how large language models reason through complex problems — by "thinking aloud" in text.
However, driving is fundamentally a continuous, real-time sensorimotor task. A human driver does not narrate every lane change or braking decision in words. Instead, experienced drivers rely on highly compressed, intuitive spatial reasoning that operates far below the level of verbal articulation. LatentVLA takes inspiration from this observation.
The core insight is that language tokens are an inefficient intermediate representation for spatial-temporal reasoning. Words like "the car ahead is slowing down" are lossy compressions of rich perceptual data. By the time the model has verbalized its understanding, critical nuances about relative velocity, trajectory curvature, and collision risk may have been lost or simplified.

