livecc by showlab

Video LLM with real-time commentary

Created 10 months ago

362 stars

Top 77.6% on SourcePulse

Project Summary

LiveCC is a video Large Language Model (LLM) designed for real-time commentary and analysis of video content. It addresses the challenge of processing streaming video and audio data efficiently, enabling applications like live video summarization and interactive video exploration. The project targets researchers and developers working with multimodal AI and video understanding.

How It Works

LiveCC leverages a novel video-ASR streaming method to integrate speech transcription directly into the video processing pipeline. This approach allows the model to process video frames and corresponding audio segments concurrently, facilitating real-time understanding and commentary. The architecture is built upon the Qwen2-VL-7B model, enhanced with the Liger kernel for efficient attention mechanisms, and trained on large-scale datasets including Live-CC-5M and various SFT datasets.

Quick Start & Requirements

Installation: pip install livecc-utils==0.0.2 and other dependencies listed in the README.
Prerequisites: Python >= 3.11, PyTorch, Transformers (<=4.51.3), Accelerate, DeepSpeed, Flash-attn, Gradio, OpenCV, Decord, Datasets, Tensorboard, Pillow-heif, gpustat, timm, sentencepiece, openai, av (==12.0.0), qwen_vl_utils, liger_kernel, numpy (==1.24.4). BF16/TF32 support is recommended for training.
Demo: Run python demo/app.py for Gradio demo or python demo/cli.py for CLI inference.
Documentation: Inference details are in inference.md.

Highlighted Details

Achieves State-of-the-Art (SOTA) performance on both streaming and offline video benchmarks.
Supports real-time commentary generation.
Trained on a novel 5M video-ASR dataset (Live-CC-5M).
Utilizes the Liger kernel for efficient attention, crucial for real-time performance.

Maintenance & Community

The project is associated with CVPR 2025. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, the use of Qwen2-VL-7B as a base model suggests potential licensing considerations inherited from its parent model. Compatibility for commercial use would require verification of the specific license terms.

Limitations & Caveats

The README mentions that evaluation for MVBench and OVOBench is pending due to developer busyness. The provided training scripts are for single-node setups, requiring adjustments for multi-node distributed training. GPT-4o evaluation results may have slight variations due to output instability.

livecc by showlab

Explore Similar Projects

Youku-mPLUG by X-PLUG

Ola by Ola-Omni

ml-slowfast-llava by apple

Flash-VStream by IVGSZ

tarsier by bytedance

videollm-online by showlab

Video-ChatGPT by mbzuai-oryx

Awesome-LLMs-for-Video-Understanding by yunlong10

VITA by VITA-MLLM

Qwen-VL-Series-Finetune by 2U1

Video-LLaMA by DAMO-NLP-SG

NarratoAI by linyqh