VideoMind by yeliudev

Agent framework for advanced long video reasoning

Created 11 months ago

305 stars

Top 88.0% on SourcePulse

Project Summary

VideoMind is a multi-modal agent framework designed for advanced reasoning over long videos. It addresses challenges in temporal-grounded understanding by emulating human-like cognitive processes, such as task decomposition and moment verification. This framework offers enhanced video reasoning capabilities for researchers and developers in the AI and computer vision domains.

How It Works

The core of VideoMind is its "Chain-of-LoRA Agent" architecture, which mimics human reasoning strategies. It breaks down complex video understanding tasks into progressive steps, involving localization of relevant moments, verification of information, and synthesis of answers. This modular, step-by-step approach is designed to improve accuracy and robustness in handling the temporal complexities inherent in long video content.

Quick Start & Requirements

An online Gradio demo is available, with guidelines for local deployment provided in DEMO.md.
Training and evaluation are supported, with detailed guides in TRAIN.md and EVAL.md.
Hardware requirements include NVIDIA GPUs or Ascend NPUs, supporting both single-node and multi-node configurations.
Efficient training techniques leverage DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, and Liger-Kernel.

Highlighted Details

Achieves strong performance across numerous video understanding benchmarks (e.g., ZS CG-Bench, ReXTime, NExT-GQA, QVHighlights) with reported results for 2B and 7B parameter models.
Releases the comprehensive VideoMind-SFT dataset, comprising 481K samples across grounding, verification, and planning tasks, along with processed annotations for 27 datasets.
Features a modular agent design incorporating specialized components for Grounding, Verification, and Planning.

Maintenance & Community

The project is authored by researchers from The Hong Kong Polytechnic University and the National University of Singapore. No specific community channels (e.g., Discord, Slack) or detailed maintenance information are provided in the README.

Licensing & Compatibility

The README does not specify a software license. This omission requires further investigation for adoption decisions, particularly concerning commercial use or integration with proprietary systems.

Limitations & Caveats

The README does not explicitly list any limitations, known bugs, or caveats. Given the recent release dates (March 2025) mentioned in the "News" section, the project may still be under active development and subject to ongoing changes.

Health Check

Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

5 stars in the last 30 days