embodied-generalist  by embodied-generalist

3D embodied generalist agent (research paper)

created 1 year ago
447 stars

Top 68.2% on sourcepulse

GitHubView on GitHub
Project Summary

LEO is an embodied, multi-modal generalist agent designed for interaction within 3D environments, capable of grounding, reasoning, chatting, planning, and acting. It targets researchers and developers in embodied AI and robotics, offering a unified framework for complex 3D world tasks.

How It Works

LEO employs a two-stage training process: 3D vision-language (VL) alignment and 3D vision-language-action (VLA) instruction tuning. This approach leverages extensive, diverse datasets including object captioning, referring expressions, scene captioning, QA, dialogue, task planning, navigation, and manipulation. The architecture integrates a large language model (Vicuna-7B) with 3D perception modules (PointNet++, PointBERT), enabling it to process and act upon 3D scene information and natural language instructions.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n leo python=3.9), activate it, and install PyTorch (e.g., 1.12.1 with cudatoolkit=11.3), PEFT (0.5.0), and other dependencies (pip install -r requirements.txt). Point cloud backbones may require manual compilation or downloading pre-compiled files.
  • Prerequisites: Python 3.9, PyTorch 1.12.1, CUDA 11.3, PEFT 0.5.0. Requires significant disk space for scan data (under 10GB for streamlined data) and annotations.
  • Resources: Training requires substantial GPU resources (e.g., NVIDIA A100/A800).
  • Links: GitHub Repo, Huggingface Demo, Model Weights, Data Preparation.

Highlighted Details

  • Official code for ICML 2024 paper "An Embodied Generalist Agent in 3D World".
  • Trained on over 1.5 million data points across various 3D vision-language and action tasks.
  • Supports multiple 3D backbones including PointNet++ and PointBERT.
  • Includes scripts for two-stage training, inference, and scaling law analysis.

Maintenance & Community

The project is associated with ICML 2024. Further community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The Embodied AI (EAI) tasks (navigation and manipulation) data and organization are still being released. Installation of third-party point cloud libraries may require manual intervention. The README notes manual modifications to the accelerate library for specific functionalities.

Health Check
Last commit

3 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
15 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.