Streaming video LLM for online interaction within a video stream
Top 61.9% on sourcepulse
VideoLLM-online addresses the challenge of real-time interaction with streaming video content, enabling LLMs to process and respond to video as it unfolds. This is crucial for applications requiring dynamic understanding and action, such as live monitoring, interactive tutorials, or assistive technologies. The project targets researchers and developers building next-generation multimodal AI systems.
How It Works
The core innovation lies in its "online" processing approach, allowing continuous interaction with video streams rather than requiring full video pre-processing. It achieves high inference speeds (5-15 FPS) through parallelized asynchronous processing of video encoding, frame-based LLM forwarding, and response generation. The model is trained on synthesized streaming dialogue data derived from offline datasets, making the training process scalable and cost-effective.
Quick Start & Requirements
python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
(or --attn_implementation sdpa
for potential flash-attn issues).python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
transformers
, accelerate
, deepspeed
, peft
, editdistance
, Levenshtein
, tensorboard
, gradio
, moviepy
, submitit
, flash-attn
. For audio streaming, clone and install ChatTTS
and its dependencies (omegaconf
, vocos
, vector_quantize_pytorch
, cython
). Requires a recent ffmpeg
installation.data/preprocess/
.Highlighted Details
Maintenance & Community
The project is associated with CVPR 2024. Further resources and training data are available via the LiveCC Webpage.
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The README mentions potential bugs with flash-attn
, suggesting an alternative sdpa
implementation. The Hugging Face Spaces demo is noted as potentially too slow. The project relies on synthesized data, and its performance characteristics on real-world, diverse streaming data are not detailed.
3 months ago
1 day