LatentVLA: Latent Reasoning Models for Autonomous Driving

Contents

Latent Action Learning Representation Learning Distinguishing Ego-Actions from Environmental Noise VLM Training Knowledge Distillation Evaluation The limitations of open-source planning Conclusion Thank you for reading this far!References

, we discussed AlpamayoR1 (AR1), an autonomous driving model integrating a VLM to act as a reasoning backbone. It relies on a carefully collected chain-of-causation dataset. Training on this dataset enables AR1 to “reason” in natural language to solve challenging driving situations.

But what if natural language is not the best support for reasoning in driving scenarios? After all, when met with a driving situation that requires an immediate reaction, human drivers generally act reflexively rather than “reasoning in language step-by-step”. What is the alternative for driving models?

In this article, we break down the LatentVLA architecture, a convincing take against language-based approaches that requires no natural language dataset, performs reasoning in the latent space and uses knowledge distillation to meet real-time constraints.

Latent Action Learning

A large part of AR1’s success resides in the chain-of-causation dataset, the collection of which required industrial-scale efforts, a carefully elaborated labeling pipeline and extensive validation.

In contrast, LatentVLA takes a completely opposite direction: the authors argue that raw driving data already contains the structure required to train a large model and that natural language is inherently biased and difficult to align with actions. Further, generating natural language reasoning chains is inefficient since some tokens do not contribute meaningfully to the reasoning process (e.g. stop words).

Therefore, they introduce a self-supervised framework employed to predict ego-centric latent actions in a small latent space. In other words, the model uses unlabelled driving data to predict which action the driver must have taken to generate this data. These latent actions will serve as the building blocks for latent-space reasoning.

Representation Learning

To predict latent actions from unlabeled data, the authors use a method reminiscent of LAPO (learning to act without actions) [2]. This approach relies on a encoder-decoder setup where the encoder (also called “inverse-dynamics model”, IDM) uses two subsequent frames to predict a continuous action vector and the decoder (referred to as “forward dynamics model”, FDM) uses the current frame and the predicted action vector to reconstruct the next frame.

This clever setup forces the learned action representation to describe what action must have been taken to observe the state transitions in our dataset. However, this continuous action representation is still incompatible with the VLMs we intend to use. To discretise it, the authors use a VQ-VAE (Vector-Quantised Variational Auto-Encoder), which maps continuous vectors to the closest discrete vectors in a learned codebook (i.e. a dictionary of discrete actions) in a differentiable way. This is the action that will be used by the FDM to decode the next frame.

By optimising the next-frame reconstruction error, we jointly trained the IDM and FDM to encode a predictive discrete action representation.

Continuous action representations learned by LAPO from unlabeled gameplay videos on popular arcade games. Source: [2]

Distinguishing Ego-Actions from Environmental Noise

Now you might think: “The driver’s actions are not the only factor influencing the next frame when driving, what if a bird flies in front of the camera? Does this pollute the action representation?”. To this, the authors respond yes and no, there needs to be a mechanism that disentangles the impact of the driver’s actions on the future from environmental dynamics.

The elegant solution to this problem is to use a two-stage encoder-decoder setup:

Conditioned on the ground-truth trajectory, ego-state and previous frame, the encoder predicts a latent action. Since this action is conditioned on vehicle dynamics through the trajectory and ego-state, it only needs to model environmental dynamics to enable the decoder to reconstruct the next frame. This “environmental action” is then quantised and the codebook used to this end is frozen for the next stage.
Conditioned on the previous frame and the environmental action, the encoder encodes another latent action. Similarly, since the environmental dynamics are known and part of the conditioning, this second latent action is forced to encode ego-centric dynamics. Using a new codebook, this action is quantised into a discrete ego-action.

Finally, we feed both actions to the decoder to reconstruct the next frame. This setup ensures a clear separation of ego-actions and environmental dynamics.

VLM Training

Building on the learned action representation, the authors train a Qwen2.5-VL model to predict the same latent actions as the encoder-decoder model. This is achieved by having the encoder predict a trajectory of 12 latent actions for a given input frame and having the VLM optimising its negative log likelihood:

A striking difference with other approaches employing action codebooks is the number of actions tokens used by LatentVLA. Where other models like AutoVLA use an action codebook of 2048 special tokens, LatentVLA only uses 16.

This results in:

A simpler learning task: in a 2048-dimensional codebook, actions probably represent very precise driving decisions like “steer left at a 16-degree angle”. With only 16 tokens, the model probably adopts higher-level directives like “accelerate slightly”, “take a narrow right turn”, which require less demonstrations to learn.
Preserving the VLM’s pre-training knowledge: it doesn’t have to learn over 2000 “new words”.

Knowledge Distillation

Where AlpamayoR1 relied on efficient tokenisation and flow-matching diffusion to maintain real-time performance, LatentVLA goes for a completely different approach: knowledge distillation. To this end, the authors introduce a fusion module within existing E2E architectures (iPad [4] and Transfuser [5]). This fusion module is fed visual and action embeddings by the VLM and outputs features in Bird’s-Eye-View (BEV) space. These embeddings serve as keys and values in cross-attention with BEV queries produced by the E2E model. This allows E2E model to integrate insights from the VLM.

LatentVLA integrates with several E2E architectures, for simplicity, we only look at the Transfuser integration. Source: [1]

However, the VLM remains too large to be used efficiently at test-time. Therefore, a small 50M-parameter decision transformer is trained to imitate the large 3.8B Qwen2.5-VL VLM. This is achieved by minimising the KL divergence between the teacher and student distributions:

This framework enables LatentVLA to operate with a very compact reasoning backbone and provides a general approach to integrating VLM knowledge into traditional E2E architectures at a lesser cost.

Visual representation of the LatentVLA architecture with knowledge distillation. Source: [1]

Evaluation

LatentVLA is trained and evaluated on NavSim [6], a dataset composed of over 100.000 frames collected in real-world driving simulations. NavSim also includes a non-reactive simulator to evaluate open-loop planning.

In other words, the models predicts a trajectory over the next few seconds given input images. Then, this trajectory is executed in a BEV simulation operating on the assumption that actions of the ego-vehicle do not affect the actions of other agents (thus “non-reactive”). This enables to easily measure planning-related metrics such as the Predictive Driver Model Score (PDMS): a composite metric that quantifies driving safety, performance, and risk by integrating simulation outputs.

However, this type of evaluation has some important shortcomings, as we’ll discuss later.

Representation of a NavSim scene (left) along with a simulation rollout (right). Source: [1]

On this benchmark, LatentVLA obtains state-of-the-art results, improving upon standard E2E and LLM-based architectures. However, the performance increase obtained by integrating VLM knowledge into iPad and Transfuser seems limited. Focusing on the PDMS, we observe that the iPad baseline obtains a score of 91.7%. The distilled LatentVLA alternative increases the score to 92.1 (+0.4%) and the non-distilled version reaches 92.4 (another +0.3%).

This small improvement begs the question whether higher-level reasoning and world knowledge really are essential to driving.

In my opinion they have the potential to unlock a new level of driving performances, but this is poorly measured by non-interactive planning simulators.

The limitations of open-source planning

In recent years, it has become widely accepted that only evaluating driving models on open loop planning gives an incomplete picture of their real driving abilities. Indeed, open-loop planning is fundamentally different from driving and arguably easier. The main reason being that open-loop planning doesn’t involve interactions with the environment (the simulator is at best non-reactive) and reduces to imitating the trajectory of an expert. This creates multiple problems in real scenarios:

Small deviations from the learned trajectories lead to cascading errors: without dynamic interactions with the environment and other agents, open-loop models struggle to rectify trajectories that are slightly misaligned with ones they learned.
Trajectories are inherently multimodal: for each driving situation, there exist multiple trajectories and acceleration patterns leading to safe driving outcomes. However, imitation learning on a single expert trajectory collapses this multi-modality, limiting the generalisation capabilities of the model.

For these reasons, it is important to thoroughly evaluate driving models in closed-loop (i.e. reactive) simulators and warrants the use of RL post-training methods as discussed in the AR1 article.

I would bet that the discrepancy between LatentVLA and its non-VLM baselines is larger in these scenarios as reasoning could help alleviating the limitations of open-loop training.

Conclusion

In this article, we discussed LatentVLA, an approach aiming to integrate VLM knowledge into standard E2E models without relying on natural language. This approach is innovative in the sense that it enables learning useful representations from unlabeled data whereas competing works like AR1 rely on carefully annotated large-scale datasets to circumvent the ambiguity of natural language.

However, LatentVLA would benefit from more thorough evaluation, in particular in closed-loop settings.

Thank you for reading this far!

If you found this article useful, please consider sharing it; it genuinely helps support the time and effort that goes into producing this work. As always, feel free to contact me if you have questions, thoughts, or ideas for follow-ups. If you’d like to support my independent research and writing, feel free to buy me a coffee 😉

Until next time! 👋

References

Source link

LatentVLA: Latent Reasoning Models for Autonomous Driving

Latent Action Learning

Representation Learning

Distinguishing Ego-Actions from Environmental Noise

VLM Training

Knowledge Distillation

Evaluation

The limitations of open-source planning

Conclusion

Thank you for reading this far!

References

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Latent Action Learning

Representation Learning

Distinguishing Ego-Actions from Environmental Noise

VLM Training

Knowledge Distillation

Evaluation

The limitations of open-source planning

Conclusion

Thank you for reading this far!

References

Leave a Reply Cancel reply

Recent Posts

Recent Comments

You Might Also Like

Optimizing Communication for Mixture-of-Experts Training with Hybrid Expert Parallel

Using Accelerated Computing to Live-Steer Scientific Experiments at Massive Research Facilities

Nano Banana 2 vs Nano Banana Pro: What’s the Difference?

Physical Intelligence Team Unveils MEM for Robots: A Multi-Scale Memory System Giving Gemma 3-4B VLAs 15-Minute Context for Complex Tasks