VideoAgent by HKUDS

All-in-one agentic framework for video intelligence

Created 5 months ago

350 stars

Top 79.5% on SourcePulse

Project Summary

VideoAgent is an all-in-one framework designed for comprehensive video intelligence, enabling users to understand, edit, and generate video content through a natural language interface. It targets creators, researchers, and power users seeking to streamline video production and analysis without requiring deep technical expertise. The primary benefit is a seamless, conversational AI experience for complex video manipulation and creation tasks.

How It Works

VideoAgent introduces three core innovations: Intent Analysis, which intelligently decomposes user instructions into explicit and implicit sub-intents for nuanced understanding; Autonomous Tool Use & Planning, employing a graph-powered framework for dynamic workflow generation and adaptive feedback loops to orchestrate multi-agent orchestration; and Multi-Modal Understanding, transforming raw input into semantically aligned visual queries for enhanced retrieval and processing of video content. This integrated approach allows for sophisticated video manipulation and generation driven purely by dialogue.

Quick Start & Requirements

Installation: Clone the repository (git clone https://github.com/HKUDS/VideoAgent.git), create and activate a Conda environment (conda create --name videoagent python=3.10, conda activate videoagent), install prerequisites (conda install -y -c conda-forge pynini==2.1.5 ffmpeg, pip install -r requirements.txt).
Prerequisites: Requires GPU memory of at least 8GB. Supports Linux and Windows. Specific model downloads are necessary for various features (CosyVoice, fish-speech, seed-vc, DiffSinger, Whisper, ImageBind). git-lfs must be installed.
LLM Configuration: Requires API keys for Claude (essential for Agentic Graph Router), GPT, and Gemini, configured in VideoAgent/environment/config/config.yml.
Documentation: Links provided for Demos Documentation and Bilibili Homepage.

Highlighted Details

Achieves "boundless creativity" through automatic workflow construction, outperforming baselines on audio and video datasets, with superior and stable performance noted under the Claude 3.7 backbone compared to GPT-4o and Deepseek-v3.
Demonstrates superior multimodal understanding capabilities in text-to-video retrieval experiments, achieving accurate video segment retrieval using metrics like Recall, Embedding Matching, and Intersection over Union.
Exhibits significant self-improvement through iterative refinement, achieving consistent workflow composition success rates of 0.95 across various configurations due to its adaptive reflection mechanism.

Maintenance & Community

No specific details regarding maintainers, community channels (like Discord/Slack), sponsorships, or roadmaps are provided in the README. The project acknowledges contributions from the open-source community and various service providers (CosyVoice, Fish Speech, Seed-VC, DiffSinger, VideoRAG, ImageBind, Whisper, Librosa).

Licensing & Compatibility

The README does not explicitly state the software license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The framework relies heavily on external LLM APIs (Claude, GPT, Gemini), requiring valid API keys and potentially incurring costs. All video content used in demos is sourced from the internet for research purposes only, with a note to contact the developers if intellectual property rights are infringed. Specific model dependencies need to be downloaded manually.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

41 stars in the last 30 days