embodied-generalist  by embodied-generalist

3D embodied generalist agent (research paper)

Created 2 years ago
475 stars

Top 64.4% on SourcePulse

GitHubView on GitHub
Project Summary

LEO is an embodied, multi-modal generalist agent designed for interaction within 3D environments, capable of grounding, reasoning, chatting, planning, and acting. It targets researchers and developers in embodied AI and robotics, offering a unified framework for complex 3D world tasks.

How It Works

LEO employs a two-stage training process: 3D vision-language (VL) alignment and 3D vision-language-action (VLA) instruction tuning. This approach leverages extensive, diverse datasets including object captioning, referring expressions, scene captioning, QA, dialogue, task planning, navigation, and manipulation. The architecture integrates a large language model (Vicuna-7B) with 3D perception modules (PointNet++, PointBERT), enabling it to process and act upon 3D scene information and natural language instructions.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n leo python=3.9), activate it, and install PyTorch (e.g., 1.12.1 with cudatoolkit=11.3), PEFT (0.5.0), and other dependencies (pip install -r requirements.txt). Point cloud backbones may require manual compilation or downloading pre-compiled files.
  • Prerequisites: Python 3.9, PyTorch 1.12.1, CUDA 11.3, PEFT 0.5.0. Requires significant disk space for scan data (under 10GB for streamlined data) and annotations.
  • Resources: Training requires substantial GPU resources (e.g., NVIDIA A100/A800).
  • Links: GitHub Repo, Huggingface Demo, Model Weights, Data Preparation.

Highlighted Details

  • Official code for ICML 2024 paper "An Embodied Generalist Agent in 3D World".
  • Trained on over 1.5 million data points across various 3D vision-language and action tasks.
  • Supports multiple 3D backbones including PointNet++ and PointBERT.
  • Includes scripts for two-stage training, inference, and scaling law analysis.

Maintenance & Community

The project is associated with ICML 2024. Further community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The Embodied AI (EAI) tasks (navigation and manipulation) data and organization are still being released. Installation of third-party point cloud libraries may require manual intervention. The README notes manual modifications to the accelerate library for specific functionalities.

Health Check
Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.