VideoAgent by YueFan1014

Multimodal agent for video understanding

Created 1 year ago

283 stars

Top 92.4% on SourcePulse

Project Summary

VideoAgent addresses the challenge of video understanding by providing a memory-augmented multimodal agent capable of answering user questions about video content. It is designed for researchers and practitioners in computer vision and natural language processing who need to perform complex video analysis and question-answering tasks. The primary benefit is its ability to systematically process video information into a structured memory, which is then leveraged by a large language model (LLM) for accurate and context-aware responses.

How It Works

VideoAgent operates in two distinct phases: memory construction and inference. During the memory construction phase, it extracts structured information from the input video. This extracted information is then stored in a memory system. In the inference phase, an LLM interacts with this memory using a set of tools. This two-phase approach allows for a separation of concerns, enabling efficient information retrieval and sophisticated reasoning over video content.

Quick Start & Requirements

Installation: Create a conda environment named videoagent using sh conda env create -f environment.yaml. Note that the Video-LLaVA repository is cloned separately but its environment (videollava) is required. Install dependencies using pip install -e . and pip install -e ".[train]", along with flash-attn, decord, opencv-python, and pytorchvideo.
Prerequisites: Ubuntu 20.04, NVIDIA RTX 4090 (24GB GPU) is recommended. Requires an OpenAI API key.
Setup: Download and unzip cache_dir.zip and tool_models.zip into the VideoAgent directory.
Usage: Start the Video-LLaVA server (conda activate videollava && python video-llava.py), then run the demo (conda activate videoagent && python demo.py). Batch inference is available via python main.py.

Highlighted Details

Implements a memory-augmented multimodal agent for video understanding.
Features a two-phase process: memory construction and LLM-based inference.
Provides answer, object re-identification replay, and chain-of-thought logs.

Maintenance & Community

The project is associated with ECCV 2024 and cites Yue Fan et al. The repository is hosted on GitHub at YueFan1014/VideoAgent.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use is not specified.

Limitations & Caveats

The project is tested on a specific hardware configuration (Ubuntu 20.04, RTX 4090 24GB), suggesting potential performance or compatibility issues on other systems. The dependency on an external OpenAI API key implies associated costs and potential privacy considerations.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

7 stars in the last 30 days