Chat-UniVi by PKU-YuanGroup

Research paper for multimodal image and video understanding with LLMs

Created 2 years ago

942 stars

Top 38.8% on SourcePulse

Project Summary

Chat-UniVi provides a unified framework for large language models (LLMs) to understand both images and videos. It addresses the challenge of efficiently processing and integrating visual information from diverse media types into a single model, benefiting researchers and developers working on multimodal AI applications.

How It Works

Chat-UniVi employs a novel unified visual representation using dynamic visual tokens. This approach allows a limited number of tokens to capture both the spatial details of images and the temporal relationships within videos. The model is jointly trained on a mixed dataset of images and videos, enabling it to handle both modalities without architectural changes. This unified representation and joint training strategy lead to superior performance compared to models specialized for single modalities.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n chatunivi python=3.10), activate it, and install dependencies with pip install -e ..
Prerequisites: Python >= 3.10. For training, ninja and flash-attn are recommended. For Windows users, comment out deepspeed in pyproject.toml to avoid installation errors.
Demo: A Hugging Face demo is available at https://huggingface.co/spaces/Chat-UniVi/Chat-UniVi.
Documentation: Installation and usage details are in TRAIN_AND_VALIDATE.md and VISUALIZATION.md.

Highlighted Details

Achieves state-of-the-art performance on various image and video understanding benchmarks, outperforming specialized models.
Supports variable-length videos by eliminating zero-filling, significantly boosting performance.
Offers pre-trained models like Chat-UniVi-7B and Chat-UniVi-13B, with the latter trainable on 8 A100 GPUs in 3 days.
Selected as a Highlight paper at CVPR 2024.

Maintenance & Community

The project is actively maintained by the PKU-YuanGroup. Updates and discussions can be followed on the GitHub repository. Related projects like Video-LLaVA are also linked.

Licensing & Compatibility

The code is released under the Apache 2.0 license. However, the service is a research preview intended for non-commercial use only, subject to the LLaMA model license, OpenAI's data terms, and ShareGPT's privacy practices.

Limitations & Caveats

The project is primarily intended for research and non-commercial use due to licensing restrictions tied to underlying models and data. A recent revision corrected video evaluation performance figures, indicating ongoing refinement.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days