Research paper for 3D vision-language-action generative world model
Top 59.1% on sourcepulse
This repository provides 3D-VLA, a framework for connecting vision-language-action (VLA) models to the 3D physical world. It enables generative world modeling for embodied AI agents, allowing them to perceive, reason, and act in 3D environments based on natural language instructions and visual input. The target audience includes researchers and developers in robotics, embodied AI, and generative models.
How It Works
3D-VLA integrates 3D perception, reasoning, and action through a generative world model. It leverages a 3D Large Language Model (LLM) and employs "interaction tokens" to interface with the environment. Embodied diffusion models are trained and aligned with the LLM to predict goal images and point clouds, facilitating goal-conditioned generation and planning.
Quick Start & Requirements
conda create -n 3dvla python=3.9
, conda activate 3dvla
, pip install -r requirements.txt
.xFormers
is recommended for accelerating point cloud diffusion.python inference_ldm_goal_image.py --ckpt_folder anyezhy/3dvla-diffusion --image docs/cans.png --text "knock pepsi can over" --save_path result.png
python inference_pe_goal_pcd.py --input_npy docs/point_cloud.npy --text "close bottom drawer" --output_dir SAVE_PATH
Highlighted Details
Maintenance & Community
The project is associated with UMass-Embodied-AGI. The README does not specify community channels or a roadmap.
Licensing & Compatibility
The README does not explicitly state a license. Given the reliance on other libraries, users should verify compatibility with their intended use, especially for commercial applications.
Limitations & Caveats
The file structure and installation process are subject to future updates. Training scripts require specifying GPU and node counts.
9 months ago
1 week