RoboVLMs by Robot-VLAs

VLA codebase for integrating vision-language models into robot policies

Created 1 year ago

418 stars

Top 70.2% on SourcePulse

Project Summary

RoboVLMs provides a flexible codebase for integrating Vision-Language Models (VLMs) into Vision-Language-Action (VLA) models for robotics. It aims to simplify the process of adapting existing VLMs for robotic control tasks, enabling researchers and practitioners to build more generalist robot policies.

How It Works

RoboVLMs facilitates VLA model creation by abstracting the core components of VLMs. It defines a standardized interface for VLM integration, requiring users to specify key attributes like the image processor, hidden size, and the model's internal components (vision tower, text tower, etc.). This modular design allows for seamless adaptation of various VLMs, including models like PaliGemma and KosMos, by providing a clear structure for handling image-to-token conversion and feature fusion.

Quick Start & Requirements

Installation: Use conda create -n robovlms python=3.8.10 (for CALVIN) or python=3.10 (for SIMPLER), then conda activate robovlms and conda install cudatoolkit cudatoolkit-dev -y, followed by pip install -e .. For OXE dataset training, clone https://github.com/lixinghang12/openvla and install from there.
Prerequisites: CUDA toolkit, Python 3.8.10 or 3.10. For simulation benchmarks (CALVIN, SimplerEnv), specific environment setup scripts (scripts/setup_calvin.sh, scripts/simplerenv.sh) are provided.
Setup Time: Environment setup for benchmarks may take time depending on dataset downloads.
Links: Technical Report, openvla fork.

Highlighted Details

Achieves state-of-the-art performance on CALVIN benchmarks with its KosMos VLM backbone.
Supports integration of various VLMs (Flamingo, Qwen, LLaVA, KosMos2, etc.) with different VLA architectures.
Offers detailed tutorials for integrating new VLMs and configuring training pipelines.
Supports multiple action heads including LSTM, MLP, and GPT2 decoders.

Maintenance & Community

The project is led by Xinghang Li and includes authors from Tsinghua University and ByteDance Research. The README encourages community contributions.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration into closed-source projects.

Limitations & Caveats

Some supported VLMs are marked as "Not fully tested," indicating potential performance issues or requirement for hyperparameter tuning. The project relies on specific versions of transformers for certain VLMs, which may require careful dependency management.

RoboVLMs by Robot-VLAs

Explore Similar Projects

OpenThinkIMG by zhaochen0110

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

Vision-Language-Models-Overview by zli12321

awesome-vlm-architectures by gokayfem

starVLA by starVLA

molmo by allenai

open-pi-zero by allenzren

ShowUI by showlab

Vary by Ucas-HaoranWei

R1-V by StarsfieldAI

minimind-v by jingyaogong

Isaac-GR00T by NVIDIA