onevl  by xiaomi-research

Autonomous driving trajectory prediction framework

Created 3 weeks ago

New!

386 stars

Top 73.9% on SourcePulse

GitHubView on GitHub
Project Summary

OneVL is a Vision-Language-Action (VLA) framework for autonomous driving, designed to overcome the interpretability-speed trade-off inherent in trajectory prediction models. It achieves state-of-the-art accuracy with inference latency comparable to faster, non-interpretable models, making it suitable for real-time applications by providing both accurate predictions and explainable reasoning.

How It Works

The core innovation involves dual-modal auxiliary decoders that supervise compact latent tokens during training. A language auxiliary decoder reconstructs explicit Chain-of-Thought (CoT) reasoning from language latents, while a visual auxiliary decoder predicts future scene frames from visual latents, acting as a world model. At inference, these decoders are removed, and all latent tokens are prefilled in a single parallel pass. This approach achieves answer-only autoregressive (AR) prediction speeds while retaining interpretability, resolving performance degradation issues seen in prior latent CoT methods on driving tasks.

Quick Start & Requirements

Highlighted Details

  • Achieves state-of-the-art trajectory prediction accuracy across benchmarks like NAVSIM and ROADWork.
  • Inference latency is comparable to answer-only AR models, significantly faster than explicit CoT methods.
  • The language auxiliary decoder recovers 97% of explicit CoT quality while operating at answer-only speed.
  • Outperforms previous latent CoT methods on driving tasks, a critical failure OneVL resolves.
  • Staged training is identified as essential for achieving full performance.

Maintenance & Community

No specific community channels (e.g., Discord, Slack) or details on active maintenance beyond the listed authors are provided in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Model weights are based on Qwen3-VL-4B-Instruct and Emu3.5-VisionTokenizer; their respective licenses must also be considered. The Apache 2.0 license is permissive for commercial use.

Limitations & Caveats

Requires a specific version of the transformers library (>= 4.57.0). Visual explanation generation necessitates downloading external models for the Emu3.5 VQ-VAE. Performance is critically dependent on the staged training methodology.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
11
Star History
387 stars in the last 27 days

Explore Similar Projects

Feedback? Help us improve.