Multimodal model for video understanding research
Top 16.1% on sourcepulse
Video-LLaMA is an instruction-tuned audio-visual language model designed for video understanding tasks. It empowers large language models with the ability to process and interpret both visual and auditory information from videos, making it suitable for researchers and developers working on multimodal AI.
How It Works
Video-LLaMA builds upon BLIP-2 and MiniGPT-4, integrating both a Vision-Language (VL) branch and an Audio-Language (AL) branch. The VL branch uses a ViT-G/14 vision encoder with a BLIP-2 Q-Former, enhanced by a two-layer video Q-Former and frame embedding layer for video representation. The AL branch employs the ImageBind-Huge audio encoder with a similar two-layer audio Q-Former and segment embedding layer. Both branches are pre-trained on large video-caption and image-caption datasets, then fine-tuned with instruction-tuning data from various multimodal chat datasets.
Quick Start & Requirements
environment.yml
and activate it.ffmpeg
must be installed. Full model weights for Video-LLaMA-2 (7B and 13B variants) are available.eval_configs/video_llama_eval_withaudio.yaml
and run python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml --model_type llama_v2 --gpu-id 0
.Highlighted Details
Maintenance & Community
The project has released VideoLLaMA2 with an updated codebase. News and updates are posted on the repository. Links to Hugging Face and ModelScope demos are provided.
Licensing & Compatibility
The project is intended for non-commercial research use only.
Limitations & Caveats
The online demo is primarily for English and may not perform well with Chinese questions. Audio support was initially limited to Vicuna-7B, though newer versions may have broader compatibility. The framework's ability to run on certain GPUs (e.g., A10-24G) has had past issues.
1 year ago
1 day