AutoGaze by NVlabs

Efficient video understanding for large-scale multimodal models

Created 4 months ago

256 stars

Top 98.5% on SourcePulse

Project Summary

Summary

AutoGaze addresses the scalability challenge of processing high-resolution, long-form videos with Vision Transformers (ViTs) and Multimodal Large Language Models (MLLMs). By automatically identifying and removing redundant video patches, it drastically reduces the token count (4x-100x), enabling efficient analysis of complex visual data like 4K, 1K-frame videos. This empowers researchers and engineers to leverage advanced AI models on previously intractable video datasets.

How It Works

The core mechanism involves an autoregressive gazing strategy to predict and select informative video patches while discarding redundant ones. This process is trained using tasks like VideoMAE reconstruction and algorithms such as Next Token Prediction (NTP) or GRPO. The resulting gaze predictions guide downstream ViTs/MLLMs, ensuring critical information is retained with significantly fewer input tokens, thereby enhancing computational efficiency and scalability.

Quick Start & Requirements

Installation involves creating a Conda environment with Python 3.11, installing a compatible CUDA toolkit (e.g., 12.8) for PyTorch, and then using uv pip install -e . for the AutoGaze package. Key resources include a demo space, official models and data on HuggingFace, and the project's website. Detailed usage instructions are available in QUICK_START.md.

Highlighted Details

Achieves 4x-100x reduction in token count for ViTs/MLLMs.
Enables processing of up to 4K-resolution, 1K-frame videos.
Accepted to CVPR2026, indicating strong research validation.
Provides pre-trained models (AutoGaze, NVILA-HD-Video), datasets, and benchmarks (HLVid) via HuggingFace.
Demonstrates integration with models like NVILA-8B-HD-Video using the SigLIP vision encoder.

Maintenance & Community

The project offers extensive resources on HuggingFace for models, data, and benchmarks, alongside an Arxiv paper and a project website. Community interaction channels (e.g., Discord, Slack) or a public roadmap are not explicitly detailed in the README.

Licensing & Compatibility

The project's license is not specified in the provided README. This omission requires further investigation for commercial use or integration into closed-source projects.

Limitations & Caveats

While not explicitly stated as alpha, the project appears research-driven with a focus on specific integrations (e.g., SigLIP). Users should anticipate potential adaptation needs for broader compatibility beyond the demonstrated use cases. Specific CUDA toolkit versions are noted as prerequisites.

https://huggingface.co/spaces/bfshi/AutoGaze https://huggingface.co/collections/bfshi/autogaze https://autogaze.github.io/ https://arxiv.org/abs/2603.12254

Health Check

Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

54 stars in the last 30 days