Video-LLaVA  by PKU-YuanGroup

Video-LLaVA: Multimodal model for video/image understanding via LLM

created 1 year ago
3,319 stars

Top 15.0% on sourcepulse

GitHubView on GitHub
Project Summary

Video-LLaVA addresses the challenge of unified visual representation for both images and videos, enabling a single Large Language Model (LLM) to perform reasoning across both modalities. It targets researchers and developers working on multimodal AI, offering a powerful tool for video understanding and interaction.

How It Works

The core innovation lies in aligning visual features from both images and videos before projecting them into the LLM's feature space. This "alignment before projection" strategy creates a unified visual representation, allowing the LLM to process and reason about both modalities simultaneously without requiring explicit image-video pairs during training. This approach leverages the strengths of both image and video data, leading to superior performance compared to models specialized for a single modality.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e . and pip install -e ".[train]". Additional packages like flash-attn, decord, opencv-python, and pytorchvideo are required.
  • Prerequisites: Python >= 3.10, PyTorch == 2.0.1, CUDA >= 11.7.
  • Demos: Online demos are available on Hugging Face Spaces and OpenXLab.
  • Documentation: Installation and inference details are provided in the README.

Highlighted Details

  • Achieves state-of-the-art results on zero-shot video question-answering benchmarks (MSRVTT-QA, MSVD-QA, TGIF-QA).
  • Demonstrates remarkable interactive capabilities between images and videos, even without explicit image-video pair training data.
  • Supports LoRA fine-tuning for efficient adaptation.
  • Integrated into the Hugging Face Transformers library.

Maintenance & Community

The project is actively maintained by the PKU-YuanGroup, with recent updates including EMNLP 2024 acceptance and community contributions. Related projects like LanguageBind and MoE-LLaVA are also available.

Licensing & Compatibility

The majority of the project is released under the Apache 2.0 license. However, the service is intended for non-commercial use only, subject to the LLaMA model license, OpenAI's Terms of Use, and ShareGPT's Privacy Practices.

Limitations & Caveats

The service is a research preview and has non-commercial use restrictions due to underlying model licenses. Specific details on data usage and privacy are tied to third-party services.

Health Check
Last commit

8 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
96 stars in the last 90 days

Explore Similar Projects

Starred by Travis Fischer Travis Fischer(Founder of Agentic), Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), and
9 more.

LLaVA by haotian-liu

0.2%
23k
Multimodal assistant with GPT-4 level capabilities
created 2 years ago
updated 11 months ago
Feedback? Help us improve.