VideoLLaMA2  by DAMO-NLP-SG

Video-LLM research paper advancing multimodal understanding

created 1 year ago
1,200 stars

Top 33.3% on sourcepulse

GitHubView on GitHub
Project Summary

VideoLLaMA 2 is a multimodal large language model designed for advanced video understanding, including spatial-temporal reasoning and audio comprehension. It targets researchers and developers working on video analysis, question answering, and captioning, offering state-of-the-art performance on various benchmarks.

How It Works

VideoLLaMA 2 integrates visual and audio information with large language models. It employs a modular architecture, leveraging powerful vision encoders like CLIP and SigLIP, and audio encoders such as BEATs. The model processes video frames and audio streams, projecting them into a shared embedding space that is then fed into a language decoder (e.g., Mistral, Qwen2). This approach allows for comprehensive understanding of video content, enabling tasks like zero-shot video question answering and captioning.

Quick Start & Requirements

  • Installation: Clone the repository and install dependencies via pip install -r requirements.txt and pip install flash-attn==2.5.8 --no-build-isolation. An editable install is also available: pip install -e ..
  • Prerequisites: Python >= 3.8, PyTorch >= 2.2.0, CUDA >= 11.8. Specific versions of transformers (4.40.0) and tokenizers (0.19.1) are recommended for reproducibility.
  • Resources: Requires significant GPU memory, especially for larger models (72B).
  • Demos & Checkpoints: Online demos are available on Hugging Face Spaces. Checkpoints for various model sizes and configurations are also hosted on Hugging Face.

Highlighted Details

  • Achieves Top-1 performance for ~7B models on MLVU and VideoMME leaderboards.
  • Supports both vision-only and audio-visual models.
  • Offers multiple model sizes, including 7B, 8x7B (Mixtral), and 72B parameter variants.
  • Provides comprehensive training and evaluation scripts for custom datasets and benchmarks.

Maintenance & Community

The project is actively maintained by DAMO-NLP-SG. Recent updates include new model checkpoints (e.g., VideoLLaMA2.1 series) and the release of VideoLLaMA3. Links to Hugging Face checkpoints and demos are provided.

Licensing & Compatibility

Released under the Apache 2.0 license. However, the service is intended for non-commercial use ONLY, subject to the licenses of underlying models (LLaMA, Mistral) and data sources (OpenAI, ShareGPT).

Limitations & Caveats

The non-commercial use restriction is a significant limitation for enterprise adoption. The project relies on external model licenses and data terms of use, which may have further restrictions.

Health Check
Last commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
2
Star History
52 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.