LLaVA-3D  by ZCMax

LLM for 2D/3D vision-language tasks

created 10 months ago
289 stars

Top 91.9% on sourcepulse

GitHubView on GitHub
Project Summary

LLaVA-3D empowers Large Vision-Language Models (LMMs) with 3D spatial awareness, enabling them to understand and interact with 3D environments. It targets researchers and developers working with 3D vision and language tasks, offering state-of-the-art performance on 3D benchmarks while maintaining 2D capabilities.

How It Works

LLaVA-3D builds upon the LLaVA architecture by incorporating "3D Patches." These are created by adding 3D position embeddings to the 2D patch visual tokens derived from multi-view images. This approach allows the model to process 3D spatial information directly, mapping it into the LLM's space via 3D pooling and a projection layer, and aligning it with language using 3D visual-language data. This method is advantageous for its simplicity and effectiveness in achieving 3D awareness.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using conda and pip. Requires PyTorch 2.1.0 with CUDA 11.8.
  • Prerequisites: Python 3.10, PyTorch 2.1.0, CUDA 11.8, torch-scatter.
  • Setup: Clone repo, create conda environment, install PyTorch with CUDA 11.8, install torch-scatter, install package (pip install -e .), download camera parameters. Training requires additional packages (pip install -e ".[train]" and flash-attn).
  • Demo: Run inference via llava/eval/run_llava_3d.py with --model-path ChaimZhu/LLaVA-3D-7B and either --image-file for 2D or --video-path for 3D tasks.
  • Links: Project Page, Checkpoints

Highlighted Details

  • Achieves state-of-the-art performance on 3D benchmarks while maintaining comparable 2D performance to LLaVA-1.5.
  • Demonstrates significantly faster convergence and inference speeds compared to existing 3D LMMs.
  • Supports 2D tasks with single images and 3D tasks with posed RGB-D images.
  • Custom data instruction tuning tutorial available for training on user datasets.

Maintenance & Community

  • Project is actively developed, with recent updates including inference code, checkpoints, and a custom data instruction tuning tutorial.
  • The project acknowledges contributions from 3D-LLM, LLaVA, and ODIN.

Licensing & Compatibility

  • Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • Non-commercial use restriction applies.

Limitations & Caveats

The current model zoo only provides a 7B parameter version. The Gradio demo and evaluation scripts are still pending release.

Health Check
Last commit

3 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
45 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.