VEGA-3D by H-EmbodVis

3D scene understanding and spatial reasoning for MLLMs

Created 2 months ago

414 stars

Top 70.2% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> VEGA-3D tackles the spatial blindness and geometric reasoning deficits in Multimodal Large Language Models (MLLMs). It provides a plug-and-play framework that extracts implicit 3D spatial priors from pre-trained video diffusion models. By enriching MLLMs with dense geometric cues, it enhances 3D scene understanding and embodied decision-making, bypassing the need for explicit 3D supervision or heavy geometric scaffolding. This benefits researchers in embodied AI and scene understanding.

How It Works

VEGA-3D repurposes video generation models as latent world simulators, extracting spatiotemporal features from intermediate noise levels. These features are fused with MLLM semantic representations via token-level adaptive gated fusion. This approach imbues MLLMs with implicit 3D awareness, enabling robust spatial reasoning and geometric understanding without explicit 3D datasets or complex geometric pipelines.

Quick Start & Requirements

Installation involves cloning the repo, setting up a Python 3.10 Conda environment, and installing specific PyTorch (2.4.0+cu121) and Flash Attention (2.7.4.post1) versions. Users must prepare datasets per Video-3D-LLM structure and download required checkpoints (e.g., LLaVA-Video-7B-Qwen2, SigLIP) using huggingface-cli. Training scripts are provided. Official quick-start details are in the README.

Highlighted Details

Leverages implicit 3D priors from video diffusion models as latent world simulators.
Enriches MLLMs with dense geometric cues for 3D scene understanding and spatial reasoning.
Employs token-level adaptive gated fusion for integrating spatiotemporal and semantic features.
Enables embodied decision-making capabilities without explicit 3D supervision.

Maintenance & Community

Released in March 2026 with code and checkpoints for spatial reasoning. The README lacks specific community channels (Discord, Slack), roadmap links, or explicit maintenance signals beyond the initial release. Key affiliations include Huazhong University of Science and Technology and Baidu Inc.

Licensing & Compatibility

The README omits software license details. This is a significant barrier to assessing compatibility for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not detail specific limitations or known bugs. Users must manually download and configure multiple large pre-trained models and datasets. Installation requires precise PyTorch/CUDA and Flash Attention versions, potentially complicating setup. The absence of a stated license is a critical adoption caveat.

Health Check

Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

18 stars in the last 30 days