3D-VLA by UMass-Embodied-AGI

Research paper for 3D vision-language-action generative world model

Created 1 year ago

613 stars

Top 53.7% on SourcePulse

Project Summary

This repository provides 3D-VLA, a framework for connecting vision-language-action (VLA) models to the 3D physical world. It enables generative world modeling for embodied AI agents, allowing them to perceive, reason, and act in 3D environments based on natural language instructions and visual input. The target audience includes researchers and developers in robotics, embodied AI, and generative models.

How It Works

3D-VLA integrates 3D perception, reasoning, and action through a generative world model. It leverages a 3D Large Language Model (LLM) and employs "interaction tokens" to interface with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds, facilitating goal-conditioned generation and planning.

Quick Start & Requirements

Installation: conda create -n 3dvla python=3.9, conda activate 3dvla, pip install -r requirements.txt.
Prerequisites: Python 3.9, PyTorch. CUDA is implicitly required for diffusion models and LLM training. xFormers is recommended for accelerating point cloud diffusion.
Models: Pretrained diffusion models for goal images and point clouds are available on Hugging Face.
Demos:
- Goal Image Generation: python inference_ldm_goal_image.py --ckpt_folder anyezhy/3dvla-diffusion --image docs/cans.png --text "knock pepsi can over" --save_path result.png
- Goal Point Cloud Generation: python inference_pe_goal_pcd.py --input_npy docs/point_cloud.npy --text "close bottom drawer" --output_dir SAVE_PATH
Documentation: Model Card

Highlighted Details

Accepted to ICML 2024.
Released pretrained diffusion models for goal image and point cloud generation.
Leverages existing codebases and models from LAVIS, 3D-LLM, Diffusers, and Point-E.
Supports inclusion of depth information for goal image generation.

Maintenance & Community

The project is associated with UMass-Embodied-AGI. The README does not specify community channels or a roadmap.

Licensing & Compatibility

The README does not explicitly state a license. Given the reliance on other libraries, users should verify compatibility with their intended use, especially for commercial applications.

3D-VLA by UMass-Embodied-AGI

Explore Similar Projects

Awesome-Human-Motion by Foruck

PointCLIP_V2 by yangyangyang127

richdreamer by modelscope

Awesome-4D-Spatial-Intelligence by yukangcao

EmbodiedGen by HorizonRobotics

OmniControl by neu-vi

molmoact by allenai

Awesome-Robotics-3D by zubair-irshad

Awesome-3D-Scene-Generation by hzxie

Awesome-LLM-3D by ActiveVisionLab

Make-It-3D by junshutang

shap-e by openai