embodied-generalist by embodied-generalist

3D embodied generalist agent (research paper)

Created 2 years ago

473 stars

Top 64.4% on SourcePulse

Project Summary

LEO is an embodied, multi-modal generalist agent designed for interaction within 3D environments, capable of grounding, reasoning, chatting, planning, and acting. It targets researchers and developers in embodied AI and robotics, offering a unified framework for complex 3D world tasks.

How It Works

LEO employs a two-stage training process: 3D vision-language (VL) alignment and 3D vision-language-action (VLA) instruction tuning. This approach leverages extensive, diverse datasets including object captioning, referring expressions, scene captioning, QA, dialogue, task planning, navigation, and manipulation. The architecture integrates a large language model (Vicuna-7B) with 3D perception modules (PointNet++, PointBERT), enabling it to process and act upon 3D scene information and natural language instructions.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n leo python=3.9), activate it, and install PyTorch (e.g., 1.12.1 with cudatoolkit=11.3), PEFT (0.5.0), and other dependencies (pip install -r requirements.txt). Point cloud backbones may require manual compilation or downloading pre-compiled files.
Prerequisites: Python 3.9, PyTorch 1.12.1, CUDA 11.3, PEFT 0.5.0. Requires significant disk space for scan data (under 10GB for streamlined data) and annotations.
Resources: Training requires substantial GPU resources (e.g., NVIDIA A100/A800).
Links: GitHub Repo, Huggingface Demo, Model Weights, Data Preparation.

Highlighted Details

Official code for ICML 2024 paper "An Embodied Generalist Agent in 3D World".
Trained on over 1.5 million data points across various 3D vision-language and action tasks.
Supports multiple 3D backbones including PointNet++ and PointBERT.
Includes scripts for two-stage training, inference, and scaling law analysis.

Maintenance & Community

The project is associated with ICML 2024. Further community interaction channels are not explicitly listed in the README.

Licensing & Compatibility

The repository does not explicitly state a license. The code is provided for research purposes, and commercial use would require clarification.

Limitations & Caveats

The Embodied AI (EAI) tasks (navigation and manipulation) data and organization are still being released. Installation of third-party point cloud libraries may require manual intervention. The README notes manual modifications to the accelerate library for specific functionalities.

embodied-generalist by embodied-generalist

Explore Similar Projects

SceneVerse by scene-verse

LL3DA by Open3DA

LLaVA-3D by ZCMax

arctic by zc-alexfan

Point-Bind_Point-LLM by ZiyuGuo99

molmoact by allenai

Seed1.5-VL by ByteDance-Seed

habitat-challenge by facebookresearch

Awesome-LLM-3D by ActiveVisionLab

Magma by microsoft

best_AI_papers_2022 by louisfb01

Isaac-GR00T by NVIDIA