videollm-online by showlab

Streaming video LLM for online interaction within a video stream

Created 1 year ago

616 stars

Top 53.5% on SourcePulse

Project Summary

VideoLLM-online addresses the challenge of real-time interaction with streaming video content, enabling LLMs to process and respond to video as it unfolds. This is crucial for applications requiring dynamic understanding and action, such as live monitoring, interactive tutorials, or assistive technologies. The project targets researchers and developers building next-generation multimodal AI systems.

How It Works

The core innovation lies in its "online" processing approach, allowing continuous interaction with video streams rather than requiring full video pre-processing. It achieves high inference speeds (5-15 FPS) through parallelized asynchronous processing of video encoding, frame-based LLM forwarding, and response generation. The model is trained on synthesized streaming dialogue data derived from offline datasets, making the training process scalable and cost-effective.

Quick Start & Requirements

Demo: python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus (or --attn_implementation sdpa for potential flash-attn issues).
CLI: python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
Prerequisites: Python >= 3.10, Miniconda, PyTorch with CUDA 12.1, transformers, accelerate, deepspeed, peft, editdistance, Levenshtein, tensorboard, gradio, moviepy, submitit, flash-attn. For audio streaming, clone and install ChatTTS and its dependencies (omegaconf, vocos, vector_quantize_pytorch, cython). Requires a recent ffmpeg installation.
Resources: Demo available at Hugging Face Spaces.
Setup: Installation involves conda and pip commands. Data preprocessing is detailed in data/preprocess/.

Highlighted Details

Achieves 5-15 FPS on NVIDIA 3090/A100 GPUs for 10-minute videos.
Supports online interaction, proactively updating responses during a stream.
Trained entirely on Llama-synthesized streaming dialogue data.
Can be adapted for Mistral models by changing the base LLM class.

Maintenance & Community

The project is associated with CVPR 2024. Further resources and training data are available via the LiveCC Webpage.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions potential bugs with flash-attn, suggesting an alternative sdpa implementation. The Hugging Face Spaces demo is noted as potentially too slow. The project relies on synthesized data, and its performance characteristics on real-world, diverse streaming data are not detailed.

videollm-online by showlab

Explore Similar Projects

VideoChat-Flash by OpenGVLab

Flash-VStream by IVGSZ

LLaVA-Mini by ictnlp

tarsier by bytedance

VideoTuna by VideoVerses

TimeChat by RenShuhuai-Andy

MovieChat by rese1f

livecc by showlab

LLaMA-VID by JIA-Lab-research

Video-LLaVA by PKU-YuanGroup

Video-LLaMA by DAMO-NLP-SG

LLaVA-NeXT by LLaVA-VL