Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

3,111 stars

Top 15.2% on SourcePulse

View on GitHub

4 Experts Love This Project

Project Summary

Video-LLaMA is an instruction-tuned audio-visual language model designed for video understanding tasks. It empowers large language models with the ability to process and interpret both visual and auditory information from videos, making it suitable for researchers and developers working on multimodal AI.

How It Works

Video-LLaMA builds upon BLIP-2 and MiniGPT-4, integrating both a Vision-Language (VL) branch and an Audio-Language (AL) branch. The VL branch uses a ViT-G/14 vision encoder with a BLIP-2 Q-Former, enhanced by a two-layer video Q-Former and frame embedding layer for video representation. The AL branch employs the ImageBind-Huge audio encoder with a similar two-layer audio Q-Former and segment embedding layer. Both branches are pre-trained on large video-caption and image-caption datasets, then fine-tuned with instruction-tuning data from various multimodal chat datasets.

Quick Start & Requirements

Install: Create a conda environment using environment.yml and activate it.
Prerequisites: ffmpeg must be installed. Full model weights for Video-LLaMA-2 (7B and 13B variants) are available.
Demo: Configure paths in eval_configs/video_llama_eval_withaudio.yaml and run python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml --model_type llama_v2 --gpu-id 0.
Resources: Inference requires at least 1xA100 (40G/80G) or 1xA6000 GPU. Pre-training and fine-tuning recommend 8xA100 (80G) GPUs.
Docs: Official Demo

Highlighted Details

Supports both video-only and audio-visual understanding.
Offers instruction-tuned checkpoints for enhanced conversational capabilities.
Provides pre-trained checkpoints for further customization.
Includes variants with Llama-2-7B/13B-Chat as the language decoder.

Maintenance & Community

The project has released VideoLLaMA2 with an updated codebase. News and updates are posted on the repository. Links to Hugging Face and ModelScope demos are provided.

Licensing & Compatibility

The project is intended for non-commercial research use only.

Limitations & Caveats

The online demo is primarily for English and may not perform well with Chinese questions. Audio support was initially limited to Vicuna-7B, though newer versions may have broader compatibility. The framework's ability to run on certain GPUs (e.g., A10-24G) has had past issues.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

8 stars in the last 30 days