VideoLLaMA3  by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

created 6 months ago
914 stars

Top 40.7% on sourcepulse

GitHubView on GitHub
Project Summary

VideoLLaMA 3 offers advanced multimodal foundation models for comprehensive image and video understanding. It targets researchers and developers working with visual data, providing capabilities for detailed analysis and interaction with visual content.

How It Works

VideoLLaMA 3 integrates a vision encoder (SigLip) with a large language model (Qwen2.5) to process and understand visual information. It employs a novel approach to handle spatio-temporal modeling and audio understanding, enabling it to perform tasks like video captioning, question answering, and referring expression comprehension. The architecture is designed for efficient processing of video frames and their temporal relationships.

Quick Start & Requirements

  • Installation: pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118, pip install flash-attn --no-build-isolation, pip install transformers==4.46.3 accelerate==1.0.1, pip install decord ffmpeg-python imageio opencv-python. For training, clone the repo and pip install -r requirements.txt.
  • Prerequisites: Python >= 3.10, PyTorch >= 2.4.0, CUDA >= 11.8, transformers >= 4.46.3. Flash Attention 2 is recommended for inference.
  • Resources: Inference requires a GPU with CUDA 11.8+. Training requires significant GPU resources.
  • Demos & Docs: Online demo available at huggingface.co/spaces/lixin4ever/VideoLLaMA3. Inference examples and notebooks are in the repo.

Highlighted Details

  • Achieves state-of-the-art performance on benchmarks like LVBench and VideoMME for 7B models.
  • Supports both image and video understanding tasks, including general understanding, chart analysis, and temporal grounding.
  • Offers pre-trained models of various sizes (2B and 7B) and a dedicated vision encoder.
  • Provides comprehensive inference examples and training scripts for customization.

Maintenance & Community

The project is actively maintained by DAMO-NLP-SG. Related projects like VideoLLaMA 2 and VideoRefer Suite are also available. Links to Hugging Face checkpoints and arXiv papers are provided.

Licensing & Compatibility

Released under the Apache 2.0 license. However, the service is intended for non-commercial use ONLY, subject to the licenses of Qwen, OpenAI, and Gemini, and Privacy Practices of ShareGPT.

Limitations & Caveats

The non-commercial use restriction is a significant limitation for many applications. Training requires careful management of CUDA OOM errors, with several suggested optimizations like DeepSpeed or reduced sequence lengths.

Health Check
Last commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
6
Star History
155 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.