StreamVGGT by wzzheng

Real-time 4D visual geometry perception

Created 3 months ago

692 stars

Top 49.2% on SourcePulse

Project Summary

StreamVGGT addresses the challenge of real-time 4D visual geometry perception from streaming image sequences. It enables efficient, on-the-fly 3D reconstruction for interactive online applications by processing inputs incrementally, unlike offline models that require full scene reprocessing.

How It Works

StreamVGGT employs a causal transformer architecture with temporal causal attention and memory tokens. This design allows for efficient incremental reconstruction by leveraging cached information from previous frames, avoiding redundant computations and enabling real-time performance. The architecture is compatible with LLM-targeted attention mechanisms like FlashAttention for further speed optimization.

Quick Start & Requirements

Installation: Clone the repository, create a conda environment (python=3.11), and install requirements (pip install -r requirements.txt). An llvm-openmp<16 conda installation is also specified.
Prerequisites: Pretrained teacher model checkpoints are required. Datasets need to be downloaded from their respective sources and processed. For camera pose estimation, pycolmap==3.10.0, pyceres==2.3, and LightGlue are needed.
Demo: A Gradio-based demo is available via pip install -r requirements_demo.txt and python demo_gradio.py.
Links: Project Page, Online Demo, Hugging Face, Tsinghua Cloud

Highlighted Details

Real-time streaming 4D visual geometry perception.
Causal transformer architecture with temporal causal attention.
Efficient incremental reconstruction using cached memory tokens.
Compatible with FlashAttention-2 for accelerated inference.
Supports fine-tuning and training from scratch.
Evaluation scripts for Monodepth, VideoDepth, Multi-view Reconstruction, and Camera Pose Estimation.

Maintenance & Community

The project is relatively new, with code and paper released in July 2025. It is based on several established repositories (DUSt3R, MonST3R, etc.). Links to Hugging Face and Tsinghua Cloud for checkpoints are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The README notes that while the core reconstruction is fast, 3D point visualization can be significantly slower due to third-party rendering dependencies. The project is very recent, and long-term maintenance and community support are yet to be established.

StreamVGGT by wzzheng

Explore Similar Projects

SceneVerse by scene-verse

gcd by basilevh

prope by liruilong940607

Keye by Kwai-Keye

TesserAct by UMass-Embodied-AGI

LLaVA-3D by ZCMax

Oryx by Oryx-mllm

STream3R by NIRVANALAN

spann3r by HengyiWang

LLaVA-NeXT by LLaVA-VL

x-transformers by lucidrains

Open-Sora-Plan by PKU-YuanGroup