Chat-UniVi  by PKU-YuanGroup

Research paper for multimodal image and video understanding with LLMs

created 1 year ago
944 stars

Top 39.7% on sourcepulse

GitHubView on GitHub
Project Summary

Chat-UniVi provides a unified framework for large language models (LLMs) to understand both images and videos. It addresses the challenge of efficiently processing and integrating visual information from diverse media types into a single model, benefiting researchers and developers working on multimodal AI applications.

How It Works

Chat-UniVi employs a novel unified visual representation using dynamic visual tokens. This approach allows a limited number of tokens to capture both the spatial details of images and the temporal relationships within videos. The model is jointly trained on a mixed dataset of images and videos, enabling it to handle both modalities without architectural changes. This unified representation and joint training strategy lead to superior performance compared to models specialized for single modalities.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n chatunivi python=3.10), activate it, and install dependencies with pip install -e ..
  • Prerequisites: Python >= 3.10. For training, ninja and flash-attn are recommended. For Windows users, comment out deepspeed in pyproject.toml to avoid installation errors.
  • Demo: A Hugging Face demo is available at https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi.
  • Documentation: Installation and usage details are in TRAIN_AND_VALIDATE.md and VISUALIZATION.md.

Highlighted Details

  • Achieves state-of-the-art performance on various image and video understanding benchmarks, outperforming specialized models.
  • Supports variable-length videos by eliminating zero-filling, significantly boosting performance.
  • Offers pre-trained models like Chat-UniVi-7B and Chat-UniVi-13B, with the latter trainable on 8 A100 GPUs in 3 days.
  • Selected as a Highlight paper at CVPR 2024.

Maintenance & Community

The project is actively maintained by the PKU-YuanGroup. Updates and discussions can be followed on the GitHub repository. Related projects like Video-LLaVA are also linked.

Licensing & Compatibility

The code is released under the Apache 2.0 license. However, the service is a research preview intended for non-commercial use only, subject to the LLaMA model license, OpenAI's data terms, and ShareGPT's privacy practices.

Limitations & Caveats

The project is primarily intended for research and non-commercial use due to licensing restrictions tied to underlying models and data. A recent revision corrected video evaluation performance figures, indicating ongoing refinement.

Health Check
Last commit

9 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
16 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.