LLaMA-VID by JIA-Lab-research

Multimodal LLM for long videos, based on LLaVA

Created 2 years ago

855 stars

Top 41.9% on SourcePulse

Project Summary

LLaMA-VID extends existing Large Language Model (LLM) frameworks to process hour-long videos by representing visual information efficiently. Targeting researchers and developers in multimodal AI, it enables LLMs to understand and reason about extended video content, significantly increasing their contextual capacity.

How It Works

LLaMA-VID builds upon the LLaVA architecture, introducing a "context token" strategy to handle long videos. It uses an encoder-decoder setup for visual and text-guided features, transforming these into a "tailored token generation strategy." This approach allows LLMs to process significantly more visual information (up to 64K tokens) than standard models, enabling comprehension of hour-long video sequences.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e . within a Python 3.10 conda environment. Additional packages like ninja and flash-attn are recommended for training.
Prerequisites: Requires PyTorch, Transformers, and specific model weights (Vicuna, EVA-G). Training requires 8x A100 GPUs (80GB). Inference can be performed with 4-bit or 8-bit quantization on fewer GPUs.
Data: Requires downloading and organizing various datasets (LLaVA, WebVid, MovieNet, etc.) as per the specified structure.
Links: Project Page, Online Demo

Highlighted Details

Supports up to 64K tokens for hour-long video processing.
Offers pre-trained and fine-tuned models for image-only, short video, and long video tasks.
Provides CLI inference and a Gradio Web UI for user-friendly interaction.
Achieves competitive performance on image and video benchmarks.

Maintenance & Community

The project is associated with dvlab-research and is an extension of the LLaVA project. Further community interaction details are not explicitly provided in the README.

Licensing & Compatibility

The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.

Limitations & Caveats

The project's data and models are strictly for research purposes and prohibit commercial use due to licensing restrictions. Training requires substantial GPU resources (8x A100 80GB).

LLaMA-VID by JIA-Lab-research

Explore Similar Projects

VideoChat-Flash by OpenGVLab

dolphin by kaleido-lab

LongVA by EvolvingLMMs-Lab

LLaVA-Mini by ictnlp

MovieChat by rese1f

videollm-online by showlab

MiniGPT4-video by Vision-CAIR

Awesome-LLMs-for-Video-Understanding by yunlong10

CogVLM2 by zai-org

Qwen-VL-Series-Finetune by 2U1

Video-LLaVA by PKU-YuanGroup

Video-LLaMA by DAMO-NLP-SG