Video-LLaVA by PKU-YuanGroup

Video-LLaVA: Multimodal model for video/image understanding via LLM

Created 2 years ago

3,434 stars

Top 14.1% on SourcePulse

View on GitHub

2 Experts Love This Project

Jiaming Song

Chief Scientist at Luma AI

Wing Lian

Founder of Axolotl AI

Project Summary

Video-LLaVA addresses the challenge of unified visual representation for both images and videos, enabling a single Large Language Model (LLM) to perform reasoning across both modalities. It targets researchers and developers working on multimodal AI, offering a powerful tool for video understanding and interaction.

How It Works

The core innovation lies in aligning visual features from both images and videos before projecting them into the LLM's feature space. This "alignment before projection" strategy creates a unified visual representation, allowing the LLM to process and reason about both modalities simultaneously without requiring explicit image-video pairs during training. This approach leverages the strengths of both image and video data, leading to superior performance compared to models specialized for a single modality.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install -e . and pip install -e ".[train]". Additional packages like flash-attn, decord, opencv-python, and pytorchvideo are required.
Prerequisites: Python >= 3.10, PyTorch == 2.0.1, CUDA >= 11.7.
Demos: Online demos are available on Hugging Face Spaces and OpenXLab.
Documentation: Installation and inference details are provided in the README.

Highlighted Details

Achieves state-of-the-art results on zero-shot video question-answering benchmarks (MSRVTT-QA, MSVD-QA, TGIF-QA).
Demonstrates remarkable interactive capabilities between images and videos, even without explicit image-video pair training data.
Supports LoRA fine-tuning for efficient adaptation.
Integrated into the Hugging Face Transformers library.

Maintenance & Community

The project is actively maintained by the PKU-YuanGroup, with recent updates including EMNLP 2024 acceptance and community contributions. Related projects like LanguageBind and MoE-LLaVA are also available.

Licensing & Compatibility

The majority of the project is released under the Apache 2.0 license. However, the service is intended for non-commercial use only, subject to the LLaMA model license, OpenAI's Terms of Use, and ShareGPT's Privacy Practices.

Limitations & Caveats

The service is a research preview and has non-commercial use restrictions due to underlying model licenses. Specific details on data usage and privacy are tied to third-party services.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

22 stars in the last 30 days