VIMA by vimalabs

Robot manipulation via multimodal prompts (ICML'23 paper)

Created 3 years ago

841 stars

Top 42.4% on SourcePulse

Project Summary

VIMA provides an official implementation of a general robot manipulation system that uses multimodal prompts (text and vision) to control robotic agents. It targets researchers and engineers working on embodied AI and large-scale robotics, enabling a wide spectrum of tasks with a single, scalable model.

How It Works

VIMA employs an encoder-decoder transformer architecture, leveraging a pretrained language model for encoding multimodal prompts. Visual information is processed via an object-centric approach, using off-the-shelf detectors to flatten images into object tokens. The transformer decoder autoregressively generates robot control actions, conditioned on the prompt through cross-attention layers. This design offers a conceptually simple yet scalable solution for diverse robot manipulation tasks.

Quick Start & Requirements

Install via pip: pip install git+https://github.com/vimalabs/VIMA
Requires Python $\ge$ 3.9, tested on Ubuntu 20.04.
Pretrained models are available on Hugging Face.
A live demo requires installing VimaBench and a display for PyBullet GUI.
Demo command: python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}
Links: Website, arXiv, VIMA-Bench, Training Data

Highlighted Details

Uniform sequence I/O interface via multimodal prompts.
Scalable multi-task robot learner.
Object-centric approach instead of raw pixels.
Baseline implementations (VIMA-Gato, VIMA-Flamingo, VIMA-GPT) included.

Maintenance & Community

The project is associated with ICML'23.
Links to related projects like VIMA-Bench and VIMA-Data are provided.

Licensing & Compatibility

Licensed under the MIT License.
Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

The live demonstration requires a display and may not work on headless machines. The codebase is focused on the VIMA algorithm and may require additional setup for specific robotic hardware.

VIMA by vimalabs

Explore Similar Projects

Large-VLM-based-VLA-for-Robotic-Manipulation by JiuTian-VL

vla0 by NVlabs

Instruct2Act by OpenGVLab

RoboFlamingo by RoboFlamingo

CogACT by microsoft

GR00T-Dreams by NVIDIA

awesome-prompts by songtianlun

ROS-LLM by Auromix

RoboGen by Genesis-Embodied-AI

open_flamingo by mlfoundations

autoMate by yuruotong1

Isaac-GR00T by NVIDIA