3D embodied generalist agent (research paper)
Top 68.2% on sourcepulse
LEO is an embodied, multi-modal generalist agent designed for interaction within 3D environments, capable of grounding, reasoning, chatting, planning, and acting. It targets researchers and developers in embodied AI and robotics, offering a unified framework for complex 3D world tasks.
How It Works
LEO employs a two-stage training process: 3D vision-language (VL) alignment and 3D vision-language-action (VLA) instruction tuning. This approach leverages extensive, diverse datasets including object captioning, referring expressions, scene captioning, QA, dialogue, task planning, navigation, and manipulation. The architecture integrates a large language model (Vicuna-7B) with 3D perception modules (PointNet++, PointBERT), enabling it to process and act upon 3D scene information and natural language instructions.
Quick Start & Requirements
conda create -n leo python=3.9
), activate it, and install PyTorch (e.g., 1.12.1
with cudatoolkit=11.3
), PEFT (0.5.0
), and other dependencies (pip install -r requirements.txt
). Point cloud backbones may require manual compilation or downloading pre-compiled files.Highlighted Details
Maintenance & Community
The project is associated with ICML 2024. Further community interaction channels are not explicitly listed in the README.
Licensing & Compatibility
The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.
Limitations & Caveats
The Embodied AI (EAI) tasks (navigation and manipulation) data and organization are still being released. Installation of third-party point cloud libraries may require manual intervention. The README notes manual modifications to the accelerate
library for specific functionalities.
3 months ago
Inactive