Discover and explore top open-source AI tools and projects—updated daily.
QwenLMUnified vision-language-action model for embodied AI
New!
Top 56.8% on SourcePulse
Qwen-VLA introduces a unified generalist model for embodied AI tasks like manipulation and navigation. It targets robotics researchers and engineers, offering a single model that surpasses task-specific specialists across diverse platforms and environments via a novel, unified framework.
How It Works
Qwen-VLA integrates a Qwen3.5-4B vision-language backbone with a 1.15B DiT flow-matching action decoder. It unifies heterogeneous embodied data into a shared action-and-trajectory prediction space, enabling a single model to learn from diverse tasks and robot embodiments via embodiment-aware prompt conditioning, eliminating per-platform output heads. A progressive training recipe (action pretraining, multimodal continued pretraining, SFT, RL) bridges discrete tokens and continuous actions.
Quick Start & Requirements
Official information, a demo, and a technical report are available.
Highlighted Details
Maintenance & Community
Developed by the "Qwen Team." No specific community channels (e.g., Discord, Slack) or detailed roadmap information are provided in the README. The extensive author list suggests a significant research effort.
Licensing & Compatibility
No license information is specified in the provided README. This omission requires further investigation for commercial use or integration into closed-source projects.
Limitations & Caveats
The provided README does not explicitly state any limitations, unsupported platforms, or known bugs. The model is presented as a generalist solution achieving state-of-the-art performance across various benchmarks.
2 weeks ago
Inactive
microsoft
NVIDIA