VIMA  by vimalabs

Robot manipulation via multimodal prompts (ICML'23 paper)

created 2 years ago
818 stars

Top 44.3% on sourcepulse

GitHubView on GitHub
Project Summary

VIMA provides an official implementation of a general robot manipulation system that uses multimodal prompts (text and vision) to control robotic agents. It targets researchers and engineers working on embodied AI and large-scale robotics, enabling a wide spectrum of tasks with a single, scalable model.

How It Works

VIMA employs an encoder-decoder transformer architecture, leveraging a pretrained language model for encoding multimodal prompts. Visual information is processed via an object-centric approach, using off-the-shelf detectors to flatten images into object tokens. The transformer decoder autoregressively generates robot control actions, conditioned on the prompt through cross-attention layers. This design offers a conceptually simple yet scalable solution for diverse robot manipulation tasks.

Quick Start & Requirements

  • Install via pip: pip install git+https://github.com/vimalabs/VIMA
  • Requires Python $\ge$ 3.9, tested on Ubuntu 20.04.
  • Pretrained models are available on Hugging Face.
  • A live demo requires installing VimaBench and a display for PyBullet GUI.
  • Demo command: python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}
  • Links: Website, arXiv, VIMA-Bench, Training Data

Highlighted Details

  • Uniform sequence I/O interface via multimodal prompts.
  • Scalable multi-task robot learner.
  • Object-centric approach instead of raw pixels.
  • Baseline implementations (VIMA-Gato, VIMA-Flamingo, VIMA-GPT) included.

Maintenance & Community

  • The project is associated with ICML'23.
  • Links to related projects like VIMA-Bench and VIMA-Data are provided.

Licensing & Compatibility

  • Licensed under the MIT License.
  • Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The live demonstration requires a display and may not work on headless machines. The codebase is focused on the VIMA algorithm and may require additional setup for specific robotic hardware.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
19 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.