Video-RAG-master by Leon1207

Video comprehension enhanced with RAG

created 9 months ago

252 stars

Top 99.6% on SourcePulse

Project Summary

Video-RAG enhances open-source Long Video Language Models (LVLMs) by integrating visually-aligned auxiliary texts (OCR, ASR, object detection) through a retrieval-augmented generation (RAG) pipeline. This approach aims to improve video comprehension capabilities for researchers and developers working with LVLMs, offering a training-free, plug-and-play solution that achieves state-of-the-art performance without commercial APIs.

How It Works

Video-RAG augments LVLMs by retrieving and incorporating three types of visually-aligned auxiliary texts: Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), and object detection results. These texts are processed by external tools and then retrieved via a RAG mechanism, enriching the context provided to the LVLM. This method is designed to be versatile and plug-and-play, requiring no additional training for existing LVLMs.

Quick Start & Requirements

Installation: Requires cloning the LLaVA-NeXT repository and setting up a conda environment. Additional packages include spacy, faiss-cpu, easyocr, ffmpeg-python, torch==2.1.2, and torchaudio. A separate environment for APE (likely for auxiliary processing) is also needed, with dependencies installed via requirements.txt.
Prerequisites: Python 3.10, spacy English model (en_core_web_sm), and potentially specific versions of PyTorch.
Setup: Involves cloning multiple repositories, creating conda environments, installing dependencies, and copying files between directories.
Resources: Requires significant disk space for model weights and dependencies. GPU acceleration is implied for efficient operation.
Links: LLaVA-NeXT GitHub, arXiv Paper.

Highlighted Details

Achieves proprietary-level performance on open-source models, surpassing Gemini-1.5-Pro on Video-MME benchmarks.
Fully open-source implementation, avoiding commercial API dependencies.
Training-free, plug-and-play pipeline adaptable to various LVLMs.
Integrates OCR, ASR, and object detection for visually-aligned context.

Maintenance & Community

The project is associated with the paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension" by Luo et al. Further community or maintenance details are not explicitly provided in the README.

Licensing & Compatibility

The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.

Limitations & Caveats

The pipeline is built upon LLaVA-NeXT, and adapting it to other LVLMs requires manual modification of specific functions and model loading points within the vidrag_pipeline.py script. The project's status (e.g., alpha, beta) and potential for breaking changes are not stated.

Health Check

Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

42 stars in the last 30 days