Qwen-VLA by QwenLM

Unified vision-language-action model for embodied AI

Created 2 months ago

714 stars

Top 47.3% on SourcePulse

Project Summary

Qwen-VLA introduces a unified generalist model for embodied AI tasks like manipulation and navigation. It targets robotics researchers and engineers, offering a single model that surpasses task-specific specialists across diverse platforms and environments via a novel, unified framework.

How It Works

Qwen-VLA integrates a Qwen3.5-4B vision-language backbone with a 1.15B DiT flow-matching action decoder. It unifies heterogeneous embodied data into a shared action-and-trajectory prediction space, enabling a single model to learn from diverse tasks and robot embodiments via embodiment-aware prompt conditioning, eliminating per-platform output heads. A progressive training recipe (action pretraining, multimodal continued pretraining, SFT, RL) bridges discrete tokens and continuous actions.

Quick Start & Requirements

Official information, a demo, and a technical report are available.

Demo: https://huggingface.co/spaces/Qwen/Qwen-VL-Chat
Technical Report: https://arxiv.org/abs/2605.30280 Specific installation instructions, hardware prerequisites (e.g., GPU, CUDA), or setup times are not detailed in the provided README.

Highlighted Details

Generalist Performance: A single Qwen-VLA model matches or outperforms task-specific specialists across multiple simulation and real-world benchmarks.
Unified Framework: Manipulation, navigation, and trajectory prediction are handled within one shared action-and-trajectory prediction space.
Embodiment Agnosticism: Embodiment-aware prompt conditioning allows a single model to adapt to multiple robot platforms via text prompts.
OOD Generalization: Large-scale embodied pretraining yields robust out-of-distribution generalization in real-world deployments.
Real-World Validation: On ALOHA, pre-trained Qwen-VLA-aloha achieved an 83.6% average success rate, surpassing specialist models.

Maintenance & Community

Developed by the "Qwen Team." No specific community channels (e.g., Discord, Slack) or detailed roadmap information are provided in the README. The extensive author list suggests a significant research effort.

Licensing & Compatibility

No license information is specified in the provided README. This omission requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

The provided README does not explicitly state any limitations, unsupported platforms, or known bugs. The model is presented as a generalist solution achieving state-of-the-art performance across various benchmarks.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

62 stars in the last 30 days