Multimodal LLM for long video understanding research paper
Top 75.6% on sourcepulse
TimeChat is a multimodal large language model designed for understanding long videos, focusing on temporal aspects. It targets researchers and developers working with video analysis and aims to provide accurate temporal localization, dense captioning, and highlight detection by integrating timestamp information directly into the model's architecture.
How It Works
TimeChat employs a timestamp-aware frame encoder to bind visual content with its corresponding timestamp. A sliding video Q-Former generates a variable-length video token sequence, allowing the model to efficiently process videos of diverse durations. This approach enables a more nuanced understanding of temporal relationships within video content.
Quick Start & Requirements
environment.yml
, activate it (conda activate timechat
), and install PyTorch with CUDA 11.3 support (pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
).ffmpeg
, pre-trained EVA ViT-g, InstructBLIP Q-Former, LLaMA-2-7B, and Video-LLaMA-2-7B checkpoints.Highlighted Details
Maintenance & Community
The project is associated with CVPR 2024. Links to FAQ and evaluation details are provided.
Licensing & Compatibility
The project is intended for non-commercial research use only.
Limitations & Caveats
The model is released as a research preview and is strictly prohibited for illegal, harmful, violent, racist, or sexual purposes.
2 months ago
1 week