videollm-online  by showlab

Streaming video LLM for online interaction within a video stream

created 1 year ago
512 stars

Top 61.9% on sourcepulse

GitHubView on GitHub
Project Summary

VideoLLM-online addresses the challenge of real-time interaction with streaming video content, enabling LLMs to process and respond to video as it unfolds. This is crucial for applications requiring dynamic understanding and action, such as live monitoring, interactive tutorials, or assistive technologies. The project targets researchers and developers building next-generation multimodal AI systems.

How It Works

The core innovation lies in its "online" processing approach, allowing continuous interaction with video streams rather than requiring full video pre-processing. It achieves high inference speeds (5-15 FPS) through parallelized asynchronous processing of video encoding, frame-based LLM forwarding, and response generation. The model is trained on synthesized streaming dialogue data derived from offline datasets, making the training process scalable and cost-effective.

Quick Start & Requirements

  • Demo: python -m demo.app --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus (or --attn_implementation sdpa for potential flash-attn issues).
  • CLI: python -m demo.cli --resume_from_checkpoint chenjoya/videollm-online-8b-v1plus
  • Prerequisites: Python >= 3.10, Miniconda, PyTorch with CUDA 12.1, transformers, accelerate, deepspeed, peft, editdistance, Levenshtein, tensorboard, gradio, moviepy, submitit, flash-attn. For audio streaming, clone and install ChatTTS and its dependencies (omegaconf, vocos, vector_quantize_pytorch, cython). Requires a recent ffmpeg installation.
  • Resources: Demo available at Hugging Face Spaces.
  • Setup: Installation involves conda and pip commands. Data preprocessing is detailed in data/preprocess/.

Highlighted Details

  • Achieves 5-15 FPS on NVIDIA 3090/A100 GPUs for 10-minute videos.
  • Supports online interaction, proactively updating responses during a stream.
  • Trained entirely on Llama-synthesized streaming dialogue data.
  • Can be adapted for Mistral models by changing the base LLM class.

Maintenance & Community

The project is associated with CVPR 2024. Further resources and training data are available via the LiveCC Webpage.

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The README mentions potential bugs with flash-attn, suggesting an alternative sdpa implementation. The Hugging Face Spaces demo is noted as potentially too slow. The project relies on synthesized data, and its performance characteristics on real-world, diverse streaming data are not detailed.

Health Check
Last commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
62 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.