VLA codebase for integrating vision-language models into robot policies
Top 76.5% on sourcepulse
RoboVLMs provides a flexible codebase for integrating Vision-Language Models (VLMs) into Vision-Language-Action (VLA) models for robotics. It aims to simplify the process of adapting existing VLMs for robotic control tasks, enabling researchers and practitioners to build more generalist robot policies.
How It Works
RoboVLMs facilitates VLA model creation by abstracting the core components of VLMs. It defines a standardized interface for VLM integration, requiring users to specify key attributes like the image processor, hidden size, and the model's internal components (vision tower, text tower, etc.). This modular design allows for seamless adaptation of various VLMs, including models like PaliGemma and KosMos, by providing a clear structure for handling image-to-token conversion and feature fusion.
Quick Start & Requirements
conda create -n robovlms python=3.8.10
(for CALVIN) or python=3.10
(for SIMPLER), then conda activate robovlms
and conda install cudatoolkit cudatoolkit-dev -y
, followed by pip install -e .
. For OXE dataset training, clone https://github.com/lixinghang12/openvla
and install from there.scripts/setup_calvin.sh
, scripts/simplerenv.sh
) are provided.Highlighted Details
Maintenance & Community
The project is led by Xinghang Li and includes authors from Tsinghua University and ByteDance Research. The README encourages community contributions.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.
Limitations & Caveats
Some supported VLMs are marked as "Not fully tested," indicating potential performance issues or requirement for hyperparameter tuning. The project relies on specific versions of transformers
for certain VLMs, which may require careful dependency management.
6 months ago
Inactive