Research paper for multimodal image and video understanding with LLMs
Top 39.7% on sourcepulse
Chat-UniVi provides a unified framework for large language models (LLMs) to understand both images and videos. It addresses the challenge of efficiently processing and integrating visual information from diverse media types into a single model, benefiting researchers and developers working on multimodal AI applications.
How It Works
Chat-UniVi employs a novel unified visual representation using dynamic visual tokens. This approach allows a limited number of tokens to capture both the spatial details of images and the temporal relationships within videos. The model is jointly trained on a mixed dataset of images and videos, enabling it to handle both modalities without architectural changes. This unified representation and joint training strategy lead to superior performance compared to models specialized for single modalities.
Quick Start & Requirements
conda create -n chatunivi python=3.10
), activate it, and install dependencies with pip install -e .
.ninja
and flash-attn
are recommended. For Windows users, comment out deepspeed
in pyproject.toml
to avoid installation errors.TRAIN_AND_VALIDATE.md
and VISUALIZATION.md
.Highlighted Details
Maintenance & Community
The project is actively maintained by the PKU-YuanGroup. Updates and discussions can be followed on the GitHub repository. Related projects like Video-LLaVA are also linked.
Licensing & Compatibility
The code is released under the Apache 2.0 license. However, the service is a research preview intended for non-commercial use only, subject to the LLaMA model license, OpenAI's data terms, and ShareGPT's privacy practices.
Limitations & Caveats
The project is primarily intended for research and non-commercial use due to licensing restrictions tied to underlying models and data. A recent revision corrected video evaluation performance figures, indicating ongoing refinement.
9 months ago
1 day