Video comprehension enhanced with RAG
Top 99.6% on SourcePulse
Video-RAG enhances open-source Long Video Language Models (LVLMs) by integrating visually-aligned auxiliary texts (OCR, ASR, object detection) through a retrieval-augmented generation (RAG) pipeline. This approach aims to improve video comprehension capabilities for researchers and developers working with LVLMs, offering a training-free, plug-and-play solution that achieves state-of-the-art performance without commercial APIs.
How It Works
Video-RAG augments LVLMs by retrieving and incorporating three types of visually-aligned auxiliary texts: Optical Character Recognition (OCR), Automatic Speech Recognition (ASR), and object detection results. These texts are processed by external tools and then retrieved via a RAG mechanism, enriching the context provided to the LVLM. This method is designed to be versatile and plug-and-play, requiring no additional training for existing LVLMs.
Quick Start & Requirements
spacy
, faiss-cpu
, easyocr
, ffmpeg-python
, torch==2.1.2
, and torchaudio
. A separate environment for APE (likely for auxiliary processing) is also needed, with dependencies installed via requirements.txt
.spacy
English model (en_core_web_sm
), and potentially specific versions of PyTorch.Highlighted Details
Maintenance & Community
The project is associated with the paper "Video-RAG: Visually-aligned Retrieval-Augmented Long Video Comprehension" by Luo et al. Further community or maintenance details are not explicitly provided in the README.
Licensing & Compatibility
The README does not specify a license. Compatibility for commercial use or closed-source linking is not detailed.
Limitations & Caveats
The pipeline is built upon LLaVA-NeXT, and adapting it to other LVLMs requires manual modification of specific functions and model loading points within the vidrag_pipeline.py
script. The project's status (e.g., alpha, beta) and potential for breaking changes are not stated.
1 month ago
1 day