onevl by xiaomi-research

Autonomous driving trajectory prediction framework

Created 2 months ago

446 stars

Top 66.5% on SourcePulse

Project Summary

OneVL is a Vision-Language-Action (VLA) framework for autonomous driving, designed to overcome the interpretability-speed trade-off inherent in trajectory prediction models. It achieves state-of-the-art accuracy with inference latency comparable to faster, non-interpretable models, making it suitable for real-time applications by providing both accurate predictions and explainable reasoning.

How It Works

The core innovation involves dual-modal auxiliary decoders that supervise compact latent tokens during training. A language auxiliary decoder reconstructs explicit Chain-of-Thought (CoT) reasoning from language latents, while a visual auxiliary decoder predicts future scene frames from visual latents, acting as a world model. At inference, these decoders are removed, and all latent tokens are prefilled in a single parallel pass. This approach achieves answer-only autoregressive (AR) prediction speeds while retaining interpretability, resolving performance degradation issues seen in prior latent CoT methods on driving tasks.

Quick Start & Requirements

Installation: Create and activate a Python 3.12 virtual environment, then run pip install -r requirements.txt.
Prerequisites: Python 3.10+ (3.12 recommended), CUDA-enabled GPU (≥16 GB VRAM recommended for inference with auxiliary decoders). Key dependencies include torch==2.10.0, transformers==4.57.0, and omegaconf>=2.3.0.
Resources: Inference code and model weights are available. Visualization requires downloading the BAAI/Emu3.5-VisionTokenizer model.
Links:
- Tech Report: https://arxiv.org/abs/2604.18486
- Project Page: https://xiaomi-embodied-intelligence.github.io/OneVL/
- Model Weights: https://huggingface.co/collections/xiaomi-research/onevl-models/

Highlighted Details

Achieves state-of-the-art trajectory prediction accuracy across benchmarks like NAVSIM and ROADWork.
Inference latency is comparable to answer-only AR models, significantly faster than explicit CoT methods.
The language auxiliary decoder recovers 97% of explicit CoT quality while operating at answer-only speed.
Outperforms previous latent CoT methods on driving tasks, a critical failure OneVL resolves.
Staged training is identified as essential for achieving full performance.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or details on active maintenance beyond the listed authors are provided in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Model weights are based on Qwen3-VL-4B-Instruct and Emu3.5-VisionTokenizer; their respective licenses must also be considered. The Apache 2.0 license is permissive for commercial use.

Limitations & Caveats

Requires a specific version of the transformers library (>= 4.57.0). Visual explanation generation necessitates downloading external models for the Emu3.5 VQ-VAE. Performance is critically dependent on the staged training methodology.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

25 stars in the last 30 days