Discover and explore top open-source AI tools and projects—updated daily.
YueFan1014Multimodal agent for video understanding
Top 95.9% on SourcePulse
VideoAgent addresses the challenge of video understanding by providing a memory-augmented multimodal agent capable of answering user questions about video content. It is designed for researchers and practitioners in computer vision and natural language processing who need to perform complex video analysis and question-answering tasks. The primary benefit is its ability to systematically process video information into a structured memory, which is then leveraged by a large language model (LLM) for accurate and context-aware responses.
How It Works
VideoAgent operates in two distinct phases: memory construction and inference. During the memory construction phase, it extracts structured information from the input video. This extracted information is then stored in a memory system. In the inference phase, an LLM interacts with this memory using a set of tools. This two-phase approach allows for a separation of concerns, enabling efficient information retrieval and sophisticated reasoning over video content.
Quick Start & Requirements
videoagent using sh conda env create -f environment.yaml. Note that the Video-LLaVA repository is cloned separately but its environment (videollava) is required. Install dependencies using pip install -e . and pip install -e ".[train]", along with flash-attn, decord, opencv-python, and pytorchvideo.cache_dir.zip and tool_models.zip into the VideoAgent directory.conda activate videollava && python video-llava.py), then run the demo (conda activate videoagent && python demo.py). Batch inference is available via python main.py.Highlighted Details
Maintenance & Community
The project is associated with ECCV 2024 and cites Yue Fan et al. The repository is hosted on GitHub at YueFan1014/VideoAgent.
Licensing & Compatibility
The README does not explicitly state the license. Compatibility for commercial use is not specified.
Limitations & Caveats
The project is tested on a specific hardware configuration (Ubuntu 20.04, RTX 4090 24GB), suggesting potential performance or compatibility issues on other systems. The dependency on an external OpenAI API key implies associated costs and potential privacy considerations.
11 months ago
Inactive
LargeWorldModel