StreamVGGT  by wzzheng

Real-time 4D visual geometry perception

Created 2 months ago
621 stars

Top 53.1% on SourcePulse

GitHubView on GitHub
Project Summary

StreamVGGT addresses the challenge of real-time 4D visual geometry perception from streaming image sequences. It enables efficient, on-the-fly 3D reconstruction for interactive online applications by processing inputs incrementally, unlike offline models that require full scene reprocessing.

How It Works

StreamVGGT employs a causal transformer architecture with temporal causal attention and memory tokens. This design allows for efficient incremental reconstruction by leveraging cached information from previous frames, avoiding redundant computations and enabling real-time performance. The architecture is compatible with LLM-targeted attention mechanisms like FlashAttention for further speed optimization.

Quick Start & Requirements

  • Installation: Clone the repository, create a conda environment (python=3.11), and install requirements (pip install -r requirements.txt). An llvm-openmp<16 conda installation is also specified.
  • Prerequisites: Pretrained teacher model checkpoints are required. Datasets need to be downloaded from their respective sources and processed. For camera pose estimation, pycolmap==3.10.0, pyceres==2.3, and LightGlue are needed.
  • Demo: A Gradio-based demo is available via pip install -r requirements_demo.txt and python demo_gradio.py.
  • Links: Project Page, Online Demo, Hugging Face, Tsinghua Cloud

Highlighted Details

  • Real-time streaming 4D visual geometry perception.
  • Causal transformer architecture with temporal causal attention.
  • Efficient incremental reconstruction using cached memory tokens.
  • Compatible with FlashAttention-2 for accelerated inference.
  • Supports fine-tuning and training from scratch.
  • Evaluation scripts for Monodepth, VideoDepth, Multi-view Reconstruction, and Camera Pose Estimation.

Maintenance & Community

The project is relatively new, with code and paper released in July 2025. It is based on several established repositories (DUSt3R, MonST3R, etc.). Links to Hugging Face and Tsinghua Cloud for checkpoints are provided.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The README notes that while the core reconstruction is fast, 3D point visualization can be significantly slower due to third-party rendering dependencies. The project is very recent, and long-term maintenance and community support are yet to be established.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
6
Star History
91 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
10 more.

x-transformers by lucidrains

0.2%
6k
Transformer library with extensive experimental features
Created 4 years ago
Updated 5 days ago
Feedback? Help us improve.