Vimo  by HKUDS

PyTorch code for retrieval-augmented generation with long-context videos

created 6 months ago
802 stars

Top 44.9% on sourcepulse

GitHubView on GitHub
Project Summary

VideoRAG is a PyTorch framework for retrieval-augmented generation designed to process and understand extremely long-context videos, enabling users to query vast amounts of video content. It targets researchers and developers working with extensive video datasets, offering a structured approach to knowledge extraction and question answering from hours of footage.

How It Works

VideoRAG employs a novel dual-channel architecture. It combines graph-driven textual knowledge grounding to model cross-video semantic relationships with hierarchical multimodal context encoding for preserving spatiotemporal visual patterns. This approach dynamically constructs knowledge graphs to maintain semantic coherence across multiple videos, optimizing retrieval efficiency through adaptive multimodal fusion.

Quick Start & Requirements

  • Installation: Requires a conda environment with Python 3.11, PyTorch 2.1.2, and specific versions of libraries like accelerate, bitsandbytes, moviepy, pytorchvideo, timm, fvcore, eva-decord, ctranslate2, faster_whisper, neo4j, hnswlib, xxhash, nano-vectordb, transformers, tiktoken, openai, tenacity, and ImageBind (installed from source).
  • Checkpoints: Requires downloading checkpoints for MiniCPM-V, Whisper, and ImageBind.
  • Hardware: A single NVIDIA RTX 3090 (24GB VRAM) is sufficient for processing hundreds of hours of video.
  • Documentation: VideoRAG GitHub Repository

Highlighted Details

  • Efficiently processes hundreds of hours of video on a single RTX 3090.
  • Distills extensive video content into a structured, multi-modal knowledge graph.
  • Utilizes a multi-modal retrieval paradigm to align text and visual content.
  • Introduces the "LongerVideos" benchmark with over 160 videos totaling 134+ hours.

Maintenance & Community

The project is associated with HKUDS and cites foundational work from nano-graphrag and LightRAG. Further community engagement details are not explicitly provided in the README.

Licensing & Compatibility

The repository lists a LICENSE file, but the specific license type and its implications for commercial use or closed-source linking are not detailed in the README.

Limitations & Caveats

Currently tested only in an English environment; multi-language support requires modification of the WhisperModel. The evaluation process involves uploading requests to OpenAI, which may incur costs and requires API key management.

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
8
Star History
189 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.