VideoAgent  by YueFan1014

Multimodal agent for video understanding

Created 1 year ago
256 stars

Top 98.7% on SourcePulse

GitHubView on GitHub
Project Summary

VideoAgent addresses the challenge of video understanding by providing a memory-augmented multimodal agent capable of answering user questions about video content. It is designed for researchers and practitioners in computer vision and natural language processing who need to perform complex video analysis and question-answering tasks. The primary benefit is its ability to systematically process video information into a structured memory, which is then leveraged by a large language model (LLM) for accurate and context-aware responses.

How It Works

VideoAgent operates in two distinct phases: memory construction and inference. During the memory construction phase, it extracts structured information from the input video. This extracted information is then stored in a memory system. In the inference phase, an LLM interacts with this memory using a set of tools. This two-phase approach allows for a separation of concerns, enabling efficient information retrieval and sophisticated reasoning over video content.

Quick Start & Requirements

  • Installation: Create a conda environment named videoagent using sh conda env create -f environment.yaml. Note that the Video-LLaVA repository is cloned separately but its environment (videollava) is required. Install dependencies using pip install -e . and pip install -e ".[train]", along with flash-attn, decord, opencv-python, and pytorchvideo.
  • Prerequisites: Ubuntu 20.04, NVIDIA RTX 4090 (24GB GPU) is recommended. Requires an OpenAI API key.
  • Setup: Download and unzip cache_dir.zip and tool_models.zip into the VideoAgent directory.
  • Usage: Start the Video-LLaVA server (conda activate videollava && python video-llava.py), then run the demo (conda activate videoagent && python demo.py). Batch inference is available via python main.py.

Highlighted Details

  • Implements a memory-augmented multimodal agent for video understanding.
  • Features a two-phase process: memory construction and LLM-based inference.
  • Provides answer, object re-identification replay, and chain-of-thought logs.

Maintenance & Community

The project is associated with ECCV 2024 and cites Yue Fan et al. The repository is hosted on GitHub at YueFan1014/VideoAgent.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use is not specified.

Limitations & Caveats

The project is tested on a specific hardware configuration (Ubuntu 20.04, RTX 4090 24GB), suggesting potential performance or compatibility issues on other systems. The dependency on an external OpenAI API key implies associated costs and potential privacy considerations.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
14 stars in the last 30 days

Explore Similar Projects

Starred by Matei Zaharia Matei Zaharia(Cofounder of Databricks), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LWM by LargeWorldModel

0.1%
7k
Multimodal autoregressive model for long-context video/text
Created 1 year ago
Updated 11 months ago
Feedback? Help us improve.