Discover and explore top open-source AI tools and projects—updated daily.
ZCMaxLLM for 2D/3D vision-language tasks
Top 81.2% on SourcePulse
LLaVA-3D empowers Large Vision-Language Models (LMMs) with 3D spatial awareness, enabling them to understand and interact with 3D environments. It targets researchers and developers working with 3D vision and language tasks, offering state-of-the-art performance on 3D benchmarks while maintaining 2D capabilities.
How It Works
LLaVA-3D builds upon the LLaVA architecture by incorporating "3D Patches." These are created by adding 3D position embeddings to the 2D patch visual tokens derived from multi-view images. This approach allows the model to process 3D spatial information directly, mapping it into the LLM's space via 3D pooling and a projection layer, and aligning it with language using 3D visual-language data. This method is advantageous for its simplicity and effectiveness in achieving 3D awareness.
Quick Start & Requirements
conda and pip. Requires PyTorch 2.1.0 with CUDA 11.8.torch-scatter.torch-scatter, install package (pip install -e .), download camera parameters. Training requires additional packages (pip install -e ".[train]" and flash-attn).llava/eval/run_llava_3d.py with --model-path ChaimZhu/LLaVA-3D-7B and either --image-file for 2D or --video-path for 3D tasks.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The current model zoo only provides a 7B parameter version. The Gradio demo and evaluation scripts are still pending release.
1 week ago
1 day