livecc  by showlab

Video LLM with real-time commentary

Created 6 months ago
269 stars

Top 95.5% on SourcePulse

GitHubView on GitHub
Project Summary

LiveCC is a video Large Language Model (LLM) designed for real-time commentary and analysis of video content. It addresses the challenge of processing streaming video and audio data efficiently, enabling applications like live video summarization and interactive video exploration. The project targets researchers and developers working with multimodal AI and video understanding.

How It Works

LiveCC leverages a novel video-ASR streaming method to integrate speech transcription directly into the video processing pipeline. This approach allows the model to process video frames and corresponding audio segments concurrently, facilitating real-time understanding and commentary. The architecture is built upon the Qwen2-VL-7B model, enhanced with the Liger kernel for efficient attention mechanisms, and trained on large-scale datasets including Live-CC-5M and various SFT datasets.

Quick Start & Requirements

  • Installation: pip install livecc-utils==0.0.2 and other dependencies listed in the README.
  • Prerequisites: Python >= 3.11, PyTorch, Transformers (<=4.51.3), Accelerate, DeepSpeed, Flash-attn, Gradio, OpenCV, Decord, Datasets, Tensorboard, Pillow-heif, gpustat, timm, sentencepiece, openai, av (==12.0.0), qwen_vl_utils, liger_kernel, numpy (==1.24.4). BF16/TF32 support is recommended for training.
  • Demo: Run python demo/app.py for Gradio demo or python demo/cli.py for CLI inference.
  • Documentation: Inference details are in inference.md.

Highlighted Details

  • Achieves State-of-the-Art (SOTA) performance on both streaming and offline video benchmarks.
  • Supports real-time commentary generation.
  • Trained on a novel 5M video-ASR dataset (Live-CC-5M).
  • Utilizes the Liger kernel for efficient attention, crucial for real-time performance.

Maintenance & Community

The project is associated with CVPR 2025. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, the use of Qwen2-VL-7B as a base model suggests potential licensing considerations inherited from its parent model. Compatibility for commercial use would require verification of the specific license terms.

Limitations & Caveats

The README mentions that evaluation for MVBench and OVOBench is pending due to developer busyness. The provided training scripts are for single-node setups, requiring adjustments for multi-node distributed training. GPT-4o evaluation results may have slight variations due to output instability.

Health Check
Last Commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.