Discover and explore top open-source AI tools and projects—updated daily.
Multimodal agent for video understanding
Top 98.7% on SourcePulse
VideoAgent addresses the challenge of video understanding by providing a memory-augmented multimodal agent capable of answering user questions about video content. It is designed for researchers and practitioners in computer vision and natural language processing who need to perform complex video analysis and question-answering tasks. The primary benefit is its ability to systematically process video information into a structured memory, which is then leveraged by a large language model (LLM) for accurate and context-aware responses.
How It Works
VideoAgent operates in two distinct phases: memory construction and inference. During the memory construction phase, it extracts structured information from the input video. This extracted information is then stored in a memory system. In the inference phase, an LLM interacts with this memory using a set of tools. This two-phase approach allows for a separation of concerns, enabling efficient information retrieval and sophisticated reasoning over video content.
Quick Start & Requirements
videoagent
using sh conda env create -f environment.yaml
. Note that the Video-LLaVA
repository is cloned separately but its environment (videollava
) is required. Install dependencies using pip install -e .
and pip install -e ".[train]"
, along with flash-attn
, decord
, opencv-python
, and pytorchvideo
.cache_dir.zip
and tool_models.zip
into the VideoAgent directory.conda activate videollava && python video-llava.py
), then run the demo (conda activate videoagent && python demo.py
). Batch inference is available via python main.py
.Highlighted Details
Maintenance & Community
The project is associated with ECCV 2024 and cites Yue Fan et al. The repository is hosted on GitHub at YueFan1014/VideoAgent.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use is not specified.
Limitations & Caveats
The project is tested on a specific hardware configuration (Ubuntu 20.04, RTX 4090 24GB), suggesting potential performance or compatibility issues on other systems. The dependency on an external OpenAI API key implies associated costs and potential privacy considerations.
9 months ago
Inactive