3D-VLA  by UMass-Embodied-AGI

Research paper for 3D vision-language-action generative world model

created 1 year ago
548 stars

Top 59.1% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides 3D-VLA, a framework for connecting vision-language-action (VLA) models to the 3D physical world. It enables generative world modeling for embodied AI agents, allowing them to perceive, reason, and act in 3D environments based on natural language instructions and visual input. The target audience includes researchers and developers in robotics, embodied AI, and generative models.

How It Works

3D-VLA integrates 3D perception, reasoning, and action through a generative world model. It leverages a 3D Large Language Model (LLM) and employs "interaction tokens" to interface with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds, facilitating goal-conditioned generation and planning.

Quick Start & Requirements

  • Installation: conda create -n 3dvla python=3.9, conda activate 3dvla, pip install -r requirements.txt.
  • Prerequisites: Python 3.9, PyTorch. CUDA is implicitly required for diffusion models and LLM training. xFormers is recommended for accelerating point cloud diffusion.
  • Models: Pretrained diffusion models for goal images and point clouds are available on Hugging Face.
  • Demos:
    • Goal Image Generation: python inference_ldm_goal_image.py --ckpt_folder anyezhy/3dvla-diffusion --image docs/cans.png --text "knock pepsi can over" --save_path result.png
    • Goal Point Cloud Generation: python inference_pe_goal_pcd.py --input_npy docs/point_cloud.npy --text "close bottom drawer" --output_dir SAVE_PATH
  • Documentation: Model Card

Highlighted Details

  • Accepted to ICML 2024.
  • Released pretrained diffusion models for goal image and point cloud generation.
  • Leverages existing codebases and models from LAVIS, 3D-LLM, Diffusers, and Point-E.
  • Supports inclusion of depth information for goal image generation.

Maintenance & Community

The project is associated with UMass-Embodied-AGI. The README does not specify community channels or a roadmap.

Licensing & Compatibility

The README does not explicitly state a license. Given the reliance on other libraries, users should verify compatibility with their intended use, especially for commercial applications.

Limitations & Caveats

The file structure and installation process are subject to future updates. Training scripts require specifying GPU and node counts.

Health Check
Last commit

9 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
61 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.