LLM for 2D/3D vision-language tasks
Top 91.9% on sourcepulse
LLaVA-3D empowers Large Vision-Language Models (LMMs) with 3D spatial awareness, enabling them to understand and interact with 3D environments. It targets researchers and developers working with 3D vision and language tasks, offering state-of-the-art performance on 3D benchmarks while maintaining 2D capabilities.
How It Works
LLaVA-3D builds upon the LLaVA architecture by incorporating "3D Patches." These are created by adding 3D position embeddings to the 2D patch visual tokens derived from multi-view images. This approach allows the model to process 3D spatial information directly, mapping it into the LLM's space via 3D pooling and a projection layer, and aligning it with language using 3D visual-language data. This method is advantageous for its simplicity and effectiveness in achieving 3D awareness.
Quick Start & Requirements
conda
and pip
. Requires PyTorch 2.1.0 with CUDA 11.8.torch-scatter
.torch-scatter
, install package (pip install -e .
), download camera parameters. Training requires additional packages (pip install -e ".[train]"
and flash-attn
).llava/eval/run_llava_3d.py
with --model-path ChaimZhu/LLaVA-3D-7B
and either --image-file
for 2D or --video-path
for 3D tasks.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The current model zoo only provides a 7B parameter version. The Gradio demo and evaluation scripts are still pending release.
3 weeks ago
Inactive