CutClaw by GVCLab

Agentic video editor synchronizes hours of footage with music

Created 3 months ago

918 stars

Top 39.0% on SourcePulse

Project Summary

<2-3 sentences summarising what the project addresses and solves, the target audience, and the benefit.> CutClaw is an end-to-end editing system designed to automate the creation of cinematic montages from hours of raw video and audio footage, synchronized precisely with music. It targets users needing to efficiently process extensive video content, offering automated deconstruction, instruction-driven editing, smart cropping, and music-aware synchronization for high-quality, rhythmically aligned video output.

How It Works

CutClaw deconstructs raw video and audio into structured, searchable assets. A multi-agent pipeline then plans shots, selects optimal clip timestamps based on extracted musical beats and energy signals, and validates quality before rendering. This approach enables rhythm-aware cuts and instruction-guided editing styles, automating complex editing tasks by leveraging large language and vision models.

Quick Start & Requirements

Primary install / run command: Clone the repository, create and activate a Conda environment (conda create -n CutClaw python=3.12, conda activate CutClaw), and install dependencies (pip install -r requirements.txt). A GPU-accelerated Decord/NVDEC build is strongly recommended for faster video decoding.
Non-default prerequisites and dependencies: Python 3.12, a CUDA-capable GPU (recommended for performance), and API keys for recommended LLMs (e.g., Gemini, Qwen, GPT, MiniMax, Kimi, Claude) are necessary. Users must provide their own video, audio, and optional subtitle files.
Estimated setup time or resource footprint: Initial footage deconstruction can be time-consuming as it involves ASR, captioning, and scene analysis, but results are cached for subsequent edits. Runtime API latency can be significant due to numerous concurrent requests.
Links: GitHub Repository, arXiv Paper.

Highlighted Details

One-Click Deconstruction: Transforms hours-long raw video and audio into structured, searchable assets with a single click.
Instruction Control: Requires only one text instruction to steer the editing style, enabling diverse outputs like fast-paced character montages or slow-paced emotional narratives.
Smart Auto-Cropping: Content-aware cropping automatically identifies core subjects and adjusts aspect ratios for various social media platforms.
Music-Aware Sync: Extracts musical beats and energy signals to build rhythm-aware cuts that precisely match the music's pacing.

Maintenance & Community

No specific details regarding notable contributors, sponsorships, partnerships, or community channels (e.g., Discord, Slack) are provided in the README.

Licensing & Compatibility

The README does not explicitly state the project's license. This omission requires clarification for assessing commercial use or closed-source linking compatibility.

Limitations & Caveats

The system can experience slow runtime API latency due to a high volume of concurrent requests to external vision/language APIs. The initial processing of footage for deconstruction is a one-time cost per video and can be lengthy. Video codec compatibility may be an issue; videos encoded with libx264 are reported to work reliably.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days