RoboVLMs  by Robot-VLAs

VLA codebase for integrating vision-language models into robot policies

created 7 months ago
377 stars

Top 76.5% on sourcepulse

GitHubView on GitHub
Project Summary

RoboVLMs provides a flexible codebase for integrating Vision-Language Models (VLMs) into Vision-Language-Action (VLA) models for robotics. It aims to simplify the process of adapting existing VLMs for robotic control tasks, enabling researchers and practitioners to build more generalist robot policies.

How It Works

RoboVLMs facilitates VLA model creation by abstracting the core components of VLMs. It defines a standardized interface for VLM integration, requiring users to specify key attributes like the image processor, hidden size, and the model's internal components (vision tower, text tower, etc.). This modular design allows for seamless adaptation of various VLMs, including models like PaliGemma and KosMos, by providing a clear structure for handling image-to-token conversion and feature fusion.

Quick Start & Requirements

  • Installation: Use conda create -n robovlms python=3.8.10 (for CALVIN) or python=3.10 (for SIMPLER), then conda activate robovlms and conda install cudatoolkit cudatoolkit-dev -y, followed by pip install -e .. For OXE dataset training, clone https://github.com/lixinghang12/openvla and install from there.
  • Prerequisites: CUDA toolkit, Python 3.8.10 or 3.10. For simulation benchmarks (CALVIN, SimplerEnv), specific environment setup scripts (scripts/setup_calvin.sh, scripts/simplerenv.sh) are provided.
  • Setup Time: Environment setup for benchmarks may take time depending on dataset downloads.
  • Links: Technical Report, openvla fork.

Highlighted Details

  • Achieves state-of-the-art performance on CALVIN benchmarks with its KosMos VLM backbone.
  • Supports integration of various VLMs (Flamingo, Qwen, LLaVA, KosMos2, etc.) with different VLA architectures.
  • Offers detailed tutorials for integrating new VLMs and configuring training pipelines.
  • Supports multiple action heads including LSTM, MLP, and GPT2 decoders.

Maintenance & Community

The project is led by Xinghang Li and includes authors from Tsinghua University and ByteDance Research. The README encourages community contributions.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Some supported VLMs are marked as "Not fully tested," indicating potential performance issues or requirement for hyperparameter tuning. The project relies on specific versions of transformers for certain VLMs, which may require careful dependency management.

Health Check
Last commit

6 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
46 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.