LLaVA-3D by ZCMax

LLM for 2D/3D vision-language tasks

Created 1 year ago

363 stars

Top 77.4% on SourcePulse

Project Summary

LLaVA-3D empowers Large Vision-Language Models (LMMs) with 3D spatial awareness, enabling them to understand and interact with 3D environments. It targets researchers and developers working with 3D vision and language tasks, offering state-of-the-art performance on 3D benchmarks while maintaining 2D capabilities.

How It Works

LLaVA-3D builds upon the LLaVA architecture by incorporating "3D Patches." These are created by adding 3D position embeddings to the 2D patch visual tokens derived from multi-view images. This approach allows the model to process 3D spatial information directly, mapping it into the LLM's space via 3D pooling and a projection layer, and aligning it with language using 3D visual-language data. This method is advantageous for its simplicity and effectiveness in achieving 3D awareness.

Quick Start & Requirements

Install: Clone the repository and install dependencies using conda and pip. Requires PyTorch 2.1.0 with CUDA 11.8.
Prerequisites: Python 3.10, PyTorch 2.1.0, CUDA 11.8, torch-scatter.
Setup: Clone repo, create conda environment, install PyTorch with CUDA 11.8, install torch-scatter, install package (pip install -e .), download camera parameters. Training requires additional packages (pip install -e ".[train]" and flash-attn).
Demo: Run inference via llava/eval/run_llava_3d.py with --model-path ChaimZhu/LLaVA-3D-7B and either --image-file for 2D or --video-path for 3D tasks.
Links: Project Page, Checkpoints

Highlighted Details

Achieves state-of-the-art performance on 3D benchmarks while maintaining comparable 2D performance to LLaVA-1.5.
Demonstrates significantly faster convergence and inference speeds compared to existing 3D LMMs.
Supports 2D tasks with single images and 3D tasks with posed RGB-D images.
Custom data instruction tuning tutorial available for training on user datasets.

Maintenance & Community

Project is actively developed, with recent updates including inference code, checkpoints, and a custom data instruction tuning tutorial.
The project acknowledges contributions from 3D-LLM, LLaVA, and ODIN.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Non-commercial use restriction applies.

Limitations & Caveats

The current model zoo only provides a 7B parameter version. The Gradio demo and evaluation scripts are still pending release.

Health Check

Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

1

Star History

9 stars in the last 30 days

Explore Similar Projects

CE3D by Fangkang515

3D scene editor for interactive manipulation via LLM-driven chat

Created 1 year ago

Updated 7 months ago

SceneVerse by scene-verse

Scaling 3D vision-language learning for grounded scene understanding

Created 2 years ago

Updated 9 months ago

LL3DA by Open3DA

Large Language 3D Assistant for visual, textual interactions in 3D environments

Created 2 years ago

Updated 1 year ago

PointCLIP_V2 by yangyangyang127

3D open-world learning research paper

Created 3 years ago

Updated 5 months ago

Starred by

Ajay Jain

Ajay Jain(Cofounder of Genmo).

Cap3D by crockwell

Research paper for scalable 3D captioning using pretrained models

Created 2 years ago

Updated 6 months ago

PoseGPT by yfeng95

Multimodal LLM for 3D human pose understanding and reasoning

Created 2 years ago

Updated 1 year ago

Point-Bind_Point-LLM by ZiyuGuo99

3D multi-modality model aligning point clouds with language models

Created 2 years ago

Updated 2 years ago

LayoutGPT by weixi-feng

Research paper for visual planning & generation using LLMs

Created 2 years ago

Updated 1 year ago

PointLLM by InternRobotics

Multimodal LLM for understanding point clouds

Created 2 years ago

Updated 5 months ago

3D-LLM by UMass-Embodied-AGI

3D-LLM injects 3D data into large language models

Created 2 years ago

Updated 1 year ago

concept-graphs by concept-graphs

Code release for open-vocabulary 3D scene graphs

Created 2 years ago

Updated 2 months ago

Starred by

Jiayi Pan

Jiayi Pan(Author of SWE-Gym; MTS at xAI) and

Lianmin Zheng

Lianmin Zheng(Coauthor of SGLang, vLLM).

LLaVA-NeXT by LLaVA-VL

Multimodal model for image, video, and 3D understanding

Created 1 year ago

Updated 3 months ago

Feedback? Help us improve.