RoboFlamingo by RoboFlamingo

Robotics learning framework for language-conditioned robot skills via fine-tuning

Created 2 years ago

420 stars

Top 69.9% on SourcePulse

Project Summary

RoboFlamingo provides a framework for adapting Vision-Language Models (VLMs) to robot imitation learning, enabling robots to perform a variety of language-conditioned skills. It targets researchers and practitioners in robotics and AI, offering a cost-effective and user-friendly approach to fine-tuning robotic policies.

How It Works

RoboFlamingo leverages pre-trained VLMs, integrating vision encoders (like OpenCLIP) with language models (e.g., MPT, LLaMA) via cross-attention mechanisms. This allows the model to process both visual and textual inputs to generate robot actions. The framework supports different decoder types (LSTM, FC, Diffusion, GPT) and offers flexibility in configuring cross-attention frequency, enabling fine-tuning on diverse imitation datasets.

Quick Start & Requirements

Install: Clone the repository and install dependencies via pip install -r requirements.txt.
Prerequisites: Requires Python 3.8+, PyTorch, and specific pre-trained VLM models and datasets (CALVIN, OpenFlamingo). GPU memory requirements depend on model size; experiments were conducted on 8x A100 GPUs (80GB).
Setup: Download CALVIN dataset and OpenFlamingo models. Configure paths in robot_flamingo/models/factory.py.
Links: OpenFlamingo, CALVIN

Highlighted Details

Achieves state-of-the-art performance on the CALVIN benchmark, significantly outperforming existing methods.
Supports fine-tuning on single GPU servers, with pre-trained models available on Hugging Face.
Enables co-finetuning with both robot data (CALVIN) and general vision-language data (COCO, VQA) to balance task-specific and general VLM capabilities.

Maintenance & Community

The project is associated with the paper "Vision-Language Foundation Models as Effective Robot Imitators." Links to relevant datasets and models are provided.

Licensing & Compatibility

The project utilizes code and datasets licensed under MIT. This generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The diffusion decoder has known dataloader bugs. The README mentions experiments on 8x A100 GPUs, but also claims single-GPU feasibility, which may depend heavily on the chosen VLM size and configuration.

RoboFlamingo by RoboFlamingo

Explore Similar Projects

Hybrid-VLA by PKU-HMI-Lab

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

GR-1 by bytedance

vla0 by NVlabs

SpatialVLA by SpatialVLA

CogACT by microsoft

OpenDriveVLA by DriveVLA

VIMA by vimalabs

awesome-embodied-vla-va-vln by jonyzhang2023

Awesome-Robotics-Foundation-Models by robotics-survey

Vary by Ucas-HaoranWei

Isaac-GR00T by NVIDIA