Discover and explore top open-source AI tools and projects—updated daily.
NVlabsEfficient video understanding for large-scale multimodal models
Top 98.5% on SourcePulse
Summary
AutoGaze addresses the scalability challenge of processing high-resolution, long-form videos with Vision Transformers (ViTs) and Multimodal Large Language Models (MLLMs). By automatically identifying and removing redundant video patches, it drastically reduces the token count (4x-100x), enabling efficient analysis of complex visual data like 4K, 1K-frame videos. This empowers researchers and engineers to leverage advanced AI models on previously intractable video datasets.
How It Works
The core mechanism involves an autoregressive gazing strategy to predict and select informative video patches while discarding redundant ones. This process is trained using tasks like VideoMAE reconstruction and algorithms such as Next Token Prediction (NTP) or GRPO. The resulting gaze predictions guide downstream ViTs/MLLMs, ensuring critical information is retained with significantly fewer input tokens, thereby enhancing computational efficiency and scalability.
Quick Start & Requirements
Installation involves creating a Conda environment with Python 3.11, installing a compatible CUDA toolkit (e.g., 12.8) for PyTorch, and then using uv pip install -e . for the AutoGaze package. Key resources include a demo space, official models and data on HuggingFace, and the project's website. Detailed usage instructions are available in QUICK_START.md.
Highlighted Details
Maintenance & Community
The project offers extensive resources on HuggingFace for models, data, and benchmarks, alongside an Arxiv paper and a project website. Community interaction channels (e.g., Discord, Slack) or a public roadmap are not explicitly detailed in the README.
Licensing & Compatibility
The project's license is not specified in the provided README. This omission requires further investigation for commercial use or integration into closed-source projects.
Limitations & Caveats
While not explicitly stated as alpha, the project appears research-driven with a focus on specific integrations (e.g., SigLIP). Users should anticipate potential adaptation needs for broader compatibility beyond the demonstrated use cases. Specific CUDA toolkit versions are noted as prerequisites.
https://huggingface.co/spaces/bfshi/AutoGaze https://huggingface.co/collections/bfshi/autogaze https://autogaze.github.io/ https://arxiv.org/abs/2603.12254
1 week ago
Inactive
zai-org
NVlabs