Robotics learning framework for language-conditioned robot skills via fine-tuning
Top 74.0% on sourcepulse
RoboFlamingo provides a framework for adapting Vision-Language Models (VLMs) to robot imitation learning, enabling robots to perform a variety of language-conditioned skills. It targets researchers and practitioners in robotics and AI, offering a cost-effective and user-friendly approach to fine-tuning robotic policies.
How It Works
RoboFlamingo leverages pre-trained VLMs, integrating vision encoders (like OpenCLIP) with language models (e.g., MPT, LLaMA) via cross-attention mechanisms. This allows the model to process both visual and textual inputs to generate robot actions. The framework supports different decoder types (LSTM, FC, Diffusion, GPT) and offers flexibility in configuring cross-attention frequency, enabling fine-tuning on diverse imitation datasets.
Quick Start & Requirements
pip install -r requirements.txt
.robot_flamingo/models/factory.py
.Highlighted Details
Maintenance & Community
The project is associated with the paper "Vision-Language Foundation Models as Effective Robot Imitators." Links to relevant datasets and models are provided.
Licensing & Compatibility
The project utilizes code and datasets licensed under MIT. This generally permits commercial use and integration into closed-source projects.
Limitations & Caveats
The diffusion decoder has known dataloader bugs. The README mentions experiments on 8x A100 GPUs, but also claims single-GPU feasibility, which may depend heavily on the chosen VLM size and configuration.
1 year ago
1 week