LLaMA-VID  by dvlab-research

Multimodal LLM for long videos, based on LLaVA

created 1 year ago
824 stars

Top 44.0% on sourcepulse

GitHubView on GitHub
Project Summary

LLaMA-VID extends existing Large Language Model (LLM) frameworks to process hour-long videos by representing visual information efficiently. Targeting researchers and developers in multimodal AI, it enables LLMs to understand and reason about extended video content, significantly increasing their contextual capacity.

How It Works

LLaMA-VID builds upon the LLaVA architecture, introducing a "context token" strategy to handle long videos. It uses an encoder-decoder setup for visual and text-guided features, transforming these into a "tailored token generation strategy." This approach allows LLMs to process significantly more visual information (up to 64K tokens) than standard models, enabling comprehension of hour-long video sequences.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e . within a Python 3.10 conda environment. Additional packages like ninja and flash-attn are recommended for training.
  • Prerequisites: Requires PyTorch, Transformers, and specific model weights (Vicuna, EVA-G). Training requires 8x A100 GPUs (80GB). Inference can be performed with 4-bit or 8-bit quantization on fewer GPUs.
  • Data: Requires downloading and organizing various datasets (LLaVA, WebVid, MovieNet, etc.) as per the specified structure.
  • Links: Project Page, Online Demo

Highlighted Details

  • Supports up to 64K tokens for hour-long video processing.
  • Offers pre-trained and fine-tuned models for image-only, short video, and long video tasks.
  • Provides CLI inference and a Gradio Web UI for user-friendly interaction.
  • Achieves competitive performance on image and video benchmarks.

Maintenance & Community

The project is associated with dvlab-research and is an extension of the LLaVA project. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.

Limitations & Caveats

The project's data and models are strictly for research purposes and prohibit commercial use due to licensing restrictions. Training requires substantial GPU resources (8x A100 80GB).

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.