VideoLLaMA2 by DAMO-NLP-SG

Video-LLM research paper advancing multimodal understanding

Created 1 year ago

1,266 stars

Top 31.2% on SourcePulse

Project Summary

VideoLLaMA 2 is a multimodal large language model designed for advanced video understanding, including spatial-temporal reasoning and audio comprehension. It targets researchers and developers working on video analysis, question answering, and captioning, offering state-of-the-art performance on various benchmarks.

How It Works

VideoLLaMA 2 integrates visual and audio information with large language models. It employs a modular architecture, leveraging powerful vision encoders like CLIP and SigLIP, and audio encoders such as BEATs. The model processes video frames and audio streams, projecting them into a shared embedding space that is then fed into a language decoder (e.g., Mistral, Qwen2). This approach allows for comprehensive understanding of video content, enabling tasks like zero-shot video question answering and captioning.

Quick Start & Requirements

Installation: Clone the repository and install dependencies via pip install -r requirements.txt and pip install flash-attn==2.5.8 --no-build-isolation. An editable install is also available: pip install -e ..
Prerequisites: Python >= 3.8, PyTorch >= 2.2.0, CUDA >= 11.8. Specific versions of transformers (4.40.0) and tokenizers (0.19.1) are recommended for reproducibility.
Resources: Requires significant GPU memory, especially for larger models (72B).
Demos & Checkpoints: Online demos are available on Hugging Face Spaces. Checkpoints for various model sizes and configurations are also hosted on Hugging Face.

Highlighted Details

Achieves Top-1 performance for ~7B models on MLVU and VideoMME leaderboards.
Supports both vision-only and audio-visual models.
Offers multiple model sizes, including 7B, 8x7B (Mixtral), and 72B parameter variants.
Provides comprehensive training and evaluation scripts for custom datasets and benchmarks.

Maintenance & Community

The project is actively maintained by DAMO-NLP-SG. Recent updates include new model checkpoints (e.g., VideoLLaMA2.1 series) and the release of VideoLLaMA3. Links to Hugging Face checkpoints and demos are provided.

Licensing & Compatibility

Released under the Apache 2.0 license. However, the service is intended for non-commercial use ONLY, subject to the licenses of underlying models (LLaMA, Mistral) and data sources (OpenAI, ShareGPT).

Limitations & Caveats

The non-commercial use restriction is a significant limitation for enterprise adoption. The project relies on external model licenses and data terms of use, which may have further restrictions.

VideoLLaMA2 by DAMO-NLP-SG

Explore Similar Projects

ml-slowfast-llava by apple

Flash-VStream by IVGSZ

VideoGPT-plus by mbzuai-oryx

Video-MME by MME-Benchmarks

tarsier by bytedance

VLog by showlab

TimeChat by RenShuhuai-Andy

MiniGPT4-video by Vision-CAIR

Awesome-LLMs-for-Video-Understanding by yunlong10

CogVLM2 by zai-org

video-analyzer by byjlw

Video-LLaMA by DAMO-NLP-SG