Video-LLaMA  by DAMO-NLP-SG

Multimodal model for video understanding research

created 2 years ago
3,046 stars

Top 16.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

Video-LLaMA is an instruction-tuned audio-visual language model designed for video understanding tasks. It empowers large language models with the ability to process and interpret both visual and auditory information from videos, making it suitable for researchers and developers working on multimodal AI.

How It Works

Video-LLaMA builds upon BLIP-2 and MiniGPT-4, integrating both a Vision-Language (VL) branch and an Audio-Language (AL) branch. The VL branch uses a ViT-G/14 vision encoder with a BLIP-2 Q-Former, enhanced by a two-layer video Q-Former and frame embedding layer for video representation. The AL branch employs the ImageBind-Huge audio encoder with a similar two-layer audio Q-Former and segment embedding layer. Both branches are pre-trained on large video-caption and image-caption datasets, then fine-tuned with instruction-tuning data from various multimodal chat datasets.

Quick Start & Requirements

  • Install: Create a conda environment using environment.yml and activate it.
  • Prerequisites: ffmpeg must be installed. Full model weights for Video-LLaMA-2 (7B and 13B variants) are available.
  • Demo: Configure paths in eval_configs/video_llama_eval_withaudio.yaml and run python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml --model_type llama_v2 --gpu-id 0.
  • Resources: Inference requires at least 1xA100 (40G/80G) or 1xA6000 GPU. Pre-training and fine-tuning recommend 8xA100 (80G) GPUs.
  • Docs: Official Demo

Highlighted Details

  • Supports both video-only and audio-visual understanding.
  • Offers instruction-tuned checkpoints for enhanced conversational capabilities.
  • Provides pre-trained checkpoints for further customization.
  • Includes variants with Llama-2-7B/13B-Chat as the language decoder.

Maintenance & Community

The project has released VideoLLaMA2 with an updated codebase. News and updates are posted on the repository. Links to Hugging Face and ModelScope demos are provided.

Licensing & Compatibility

The project is intended for non-commercial research use only.

Limitations & Caveats

The online demo is primarily for English and may not perform well with Chinese questions. Audio support was initially limited to Vicuna-7B, though newer versions may have broader compatibility. The framework's ability to run on certain GPUs (e.g., A10-24G) has had past issues.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
1
Star History
56 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.