VIMA  by vimalabs

Robot manipulation via multimodal prompts (ICML'23 paper)

Created 2 years ago
824 stars

Top 43.0% on SourcePulse

GitHubView on GitHub
Project Summary

VIMA provides an official implementation of a general robot manipulation system that uses multimodal prompts (text and vision) to control robotic agents. It targets researchers and engineers working on embodied AI and large-scale robotics, enabling a wide spectrum of tasks with a single, scalable model.

How It Works

VIMA employs an encoder-decoder transformer architecture, leveraging a pretrained language model for encoding multimodal prompts. Visual information is processed via an object-centric approach, using off-the-shelf detectors to flatten images into object tokens. The transformer decoder autoregressively generates robot control actions, conditioned on the prompt through cross-attention layers. This design offers a conceptually simple yet scalable solution for diverse robot manipulation tasks.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/vimalabs/VIMA
  • Requires Python $\ge$ 3.9, tested on Ubuntu 20.04.
  • Pretrained models are available on Hugging Face.
  • A live demo requires installing VimaBench and a display for PyBullet GUI.
  • Demo command: python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}
  • Links: Website, arXiv, VIMA-Bench, Training Data

Highlighted Details

  • Uniform sequence I/O interface via multimodal prompts.
  • Scalable multi-task robot learner.
  • Object-centric approach instead of raw pixels.
  • Baseline implementations (VIMA-Gato, VIMA-Flamingo, VIMA-GPT) included.

Maintenance & Community

  • The project is associated with ICML'23.
  • Links to related projects like VIMA-Bench and VIMA-Data are provided.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The live demonstration requires a display and may not work on headless machines. The codebase is focused on the VIMA algorithm and may require additional setup for specific robotic hardware.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Wing Lian Wing Lian(Founder of Axolotl AI), and
10 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.