VideoLLaMA3 by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

Created 11 months ago

1,090 stars

Top 34.9% on SourcePulse

Project Summary

VideoLLaMA 3 offers advanced multimodal foundation models for comprehensive image and video understanding. It targets researchers and developers working with visual data, providing capabilities for detailed analysis and interaction with visual content.

How It Works

VideoLLaMA 3 integrates a vision encoder (SigLip) with a large language model (Qwen2.5) to process and understand visual information. It employs a novel approach to handle spatio-temporal modeling and audio understanding, enabling it to perform tasks like video captioning, question answering, and referring expression comprehension. The architecture is designed for efficient processing of video frames and their temporal relationships.

Quick Start & Requirements

Installation: pip install torch==2.4.0 torchvision==0.17.0 --extra-index-url https://download.pytorch.org/whl/cu118, pip install flash-attn --no-build-isolation, pip install transformers==4.46.3 accelerate==1.0.1, pip install decord ffmpeg-python imageio opencv-python. For training, clone the repo and pip install -r requirements.txt.
Prerequisites: Python >= 3.10, PyTorch >= 2.4.0, CUDA >= 11.8, transformers >= 4.46.3. Flash Attention 2 is recommended for inference.
Resources: Inference requires a GPU with CUDA 11.8+. Training requires significant GPU resources.
Demos & Docs: Online demo available at huggingface.co/spaces/lixin4ever/VideoLLaMA3. Inference examples and notebooks are in the repo.

Highlighted Details

Achieves state-of-the-art performance on benchmarks like LVBench and VideoMME for 7B models.
Supports both image and video understanding tasks, including general understanding, chart analysis, and temporal grounding.
Offers pre-trained models of various sizes (2B and 7B) and a dedicated vision encoder.
Provides comprehensive inference examples and training scripts for customization.

Maintenance & Community

The project is actively maintained by DAMO-NLP-SG. Related projects like VideoLLaMA 2 and VideoRefer Suite are also available. Links to Hugging Face checkpoints and arXiv papers are provided.

Licensing & Compatibility

Released under the Apache 2.0 license. However, the service is intended for non-commercial use ONLY, subject to the licenses of Qwen, OpenAI, and Gemini, and Privacy Practices of ShareGPT.

Limitations & Caveats

The non-commercial use restriction is a significant limitation for many applications. Training requires careful management of CUDA OOM errors, with several suggested optimizations like DeepSpeed or reduced sequence lengths.

VideoLLaMA3 by DAMO-NLP-SG

Explore Similar Projects

GroundingGPT by lzw-lzw

LongVA by EvolvingLMMs-Lab

tarsier by bytedance

bubogpt by magic-research

Vitron by SkyworkAI

Eagle by NVlabs

Chat-UniVi by PKU-YuanGroup

PandaGPT by yxuansu

Vary by Ucas-HaoranWei

CogVLM2 by zai-org

InternLM-XComposer by InternLM

Otter by EvolvingLMMs-Lab