Multimodal LLM for long videos, based on LLaVA
Top 44.0% on sourcepulse
LLaMA-VID extends existing Large Language Model (LLM) frameworks to process hour-long videos by representing visual information efficiently. Targeting researchers and developers in multimodal AI, it enables LLMs to understand and reason about extended video content, significantly increasing their contextual capacity.
How It Works
LLaMA-VID builds upon the LLaVA architecture, introducing a "context token" strategy to handle long videos. It uses an encoder-decoder setup for visual and text-guided features, transforming these into a "tailored token generation strategy." This approach allows LLMs to process significantly more visual information (up to 64K tokens) than standard models, enabling comprehension of hour-long video sequences.
Quick Start & Requirements
pip install -e .
within a Python 3.10 conda environment. Additional packages like ninja
and flash-attn
are recommended for training.Highlighted Details
Maintenance & Community
The project is associated with dvlab-research and is an extension of the LLaVA project. Further community interaction details are not explicitly provided in the README.
Licensing & Compatibility
The data and checkpoints are licensed for research use only. They are subject to the licenses of LLaVA, LLaMA, Vicuna, and GPT-4. The dataset is licensed under CC BY-NC 4.0, restricting commercial use.
Limitations & Caveats
The project's data and models are strictly for research purposes and prohibit commercial use due to licensing restrictions. Training requires substantial GPU resources (8x A100 80GB).
1 year ago
1 day