Robot manipulation via multimodal prompts (ICML'23 paper)
Top 44.3% on sourcepulse
VIMA provides an official implementation of a general robot manipulation system that uses multimodal prompts (text and vision) to control robotic agents. It targets researchers and engineers working on embodied AI and large-scale robotics, enabling a wide spectrum of tasks with a single, scalable model.
How It Works
VIMA employs an encoder-decoder transformer architecture, leveraging a pretrained language model for encoding multimodal prompts. Visual information is processed via an object-centric approach, using off-the-shelf detectors to flatten images into object tokens. The transformer decoder autoregressively generates robot control actions, conditioned on the prompt through cross-attention layers. This design offers a conceptually simple yet scalable solution for diverse robot manipulation tasks.
Quick Start & Requirements
pip install git+https://github.com/vimalabs/VIMA
python3 scripts/example.py --ckpt={ckpt_path} --device={device} --partition={eval_level} --task={task}
Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The live demonstration requires a display and may not work on headless machines. The codebase is focused on the VIMA algorithm and may require additional setup for specific robotic hardware.
1 year ago
1 week