VideoLLaMA2  by DAMO-NLP-SG

Video-LLM research paper advancing multimodal understanding

Created 1 year ago
1,266 stars

Top 31.2% on SourcePulse

GitHubView on GitHub
Project Summary

VideoLLaMA 2 is a multimodal large language model designed for advanced video understanding, including spatial-temporal reasoning and audio comprehension. It targets researchers and developers working on video analysis, question answering, and captioning, offering state-of-the-art performance on various benchmarks.

How It Works

VideoLLaMA 2 integrates visual and audio information with large language models. It employs a modular architecture, leveraging powerful vision encoders like CLIP and SigLIP, and audio encoders such as BEATs. The model processes video frames and audio streams, projecting them into a shared embedding space that is then fed into a language decoder (e.g., Mistral, Qwen2). This approach allows for comprehensive understanding of video content, enabling tasks like zero-shot video question answering and captioning.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt and pip install flash-attn==2.5.8 --no-build-isolation. An editable install is also available: pip install -e ..
  • Prerequisites: Python >= 3.8, PyTorch >= 2.2.0, CUDA >= 11.8. Specific versions of transformers (4.40.0) and tokenizers (0.19.1) are recommended for reproducibility.
  • Resources: Requires significant GPU memory, especially for larger models (72B).
  • Demos & Checkpoints: Online demos are available on Hugging Face Spaces. Checkpoints for various model sizes and configurations are also hosted on Hugging Face.

Highlighted Details

  • Achieves Top-1 performance for ~7B models on MLVU and VideoMME leaderboards.
  • Supports both vision-only and audio-visual models.
  • Offers multiple model sizes, including 7B, 8x7B (Mixtral), and 72B parameter variants.
  • Provides comprehensive training and evaluation scripts for custom datasets and benchmarks.

Maintenance & Community

The project is actively maintained by DAMO-NLP-SG. Recent updates include new model checkpoints (e.g., VideoLLaMA2.1 series) and the release of VideoLLaMA3. Links to Hugging Face checkpoints and demos are provided.

Licensing & Compatibility

Released under the Apache 2.0 license. However, the service is intended for non-commercial use ONLY, subject to the licenses of underlying models (LLaMA, Mistral) and data sources (OpenAI, ShareGPT).

Limitations & Caveats

The non-commercial use restriction is a significant limitation for enterprise adoption. The project relies on external model licenses and data terms of use, which may have further restrictions.

Health Check
Last Commit

11 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
12 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.