RoboFlamingo  by RoboFlamingo

Robotics learning framework for language-conditioned robot skills via fine-tuning

created 1 year ago
396 stars

Top 74.0% on sourcepulse

GitHubView on GitHub
Project Summary

RoboFlamingo provides a framework for adapting Vision-Language Models (VLMs) to robot imitation learning, enabling robots to perform a variety of language-conditioned skills. It targets researchers and practitioners in robotics and AI, offering a cost-effective and user-friendly approach to fine-tuning robotic policies.

How It Works

RoboFlamingo leverages pre-trained VLMs, integrating vision encoders (like OpenCLIP) with language models (e.g., MPT, LLaMA) via cross-attention mechanisms. This allows the model to process both visual and textual inputs to generate robot actions. The framework supports different decoder types (LSTM, FC, Diffusion, GPT) and offers flexibility in configuring cross-attention frequency, enabling fine-tuning on diverse imitation datasets.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies via pip install -r requirements.txt.
  • Prerequisites: Requires Python 3.8+, PyTorch, and specific pre-trained VLM models and datasets (CALVIN, OpenFlamingo). GPU memory requirements depend on model size; experiments were conducted on 8x A100 GPUs (80GB).
  • Setup: Download CALVIN dataset and OpenFlamingo models. Configure paths in robot_flamingo/models/factory.py.
  • Links: OpenFlamingo, CALVIN

Highlighted Details

  • Achieves state-of-the-art performance on the CALVIN benchmark, significantly outperforming existing methods.
  • Supports fine-tuning on single GPU servers, with pre-trained models available on Hugging Face.
  • Enables co-finetuning with both robot data (CALVIN) and general vision-language data (COCO, VQA) to balance task-specific and general VLM capabilities.

Maintenance & Community

The project is associated with the paper "Vision-Language Foundation Models as Effective Robot Imitators." Links to relevant datasets and models are provided.

Licensing & Compatibility

The project utilizes code and datasets licensed under MIT. This generally permits commercial use and integration into closed-source projects.

Limitations & Caveats

The diffusion decoder has known dataloader bugs. The README mentions experiments on 8x A100 GPUs, but also claims single-GPU feasibility, which may depend heavily on the chosen VLM size and configuration.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
29 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.