TimeChat by RenShuhuai-Andy

Multimodal LLM for long video understanding research paper

Created 2 years ago

407 stars

Top 71.6% on SourcePulse

Project Summary

TimeChat is a multimodal large language model designed for understanding long videos, focusing on temporal aspects. It targets researchers and developers working with video analysis and aims to provide accurate temporal localization, dense captioning, and highlight detection by integrating timestamp information directly into the model's architecture.

How It Works

TimeChat employs a timestamp-aware frame encoder to bind visual content with its corresponding timestamp. A sliding video Q-Former generates a variable-length video token sequence, allowing the model to efficiently process videos of diverse durations. This approach enables a more nuanced understanding of temporal relationships within video content.

Quick Start & Requirements

Install: Create a conda environment using environment.yml, activate it (conda activate timechat), and install PyTorch with CUDA 11.3 support (pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113).
Prerequisites: Requires ffmpeg, pre-trained EVA ViT-g, InstructBLIP Q-Former, LLaMA-2-7B, and Video-LLaMA-2-7B checkpoints.
Resources: Instruction-tuning requires 8x V100 (32G) GPUs; inference requires 1x A100 (40G/80G) or A6000.
Demo: A Jupyter Notebook demo is available.

Highlighted Details

Fine-tuned checkpoint (TimeChat-7b) released, based on LLaMA-2 7B.
Released TimeIT dataset (104K instances) for time-sensitive instruction tuning.
Zero-shot evaluation results on benchmarks like VideoMME, MVBench, and TempCompass are available.
Supports tasks including temporal localization, dense video captioning, and video highlight detection.

Maintenance & Community

The project is associated with CVPR 2024. Links to FAQ and evaluation details are provided.

Licensing & Compatibility

The project is intended for non-commercial research use only.

Limitations & Caveats

The model is released as a research preview and is strictly prohibited for illegal, harmful, violent, racist, or sexual purposes.

Health Check

Last Commit

8 months ago

Responsiveness

1 week

Pull Requests (30d)

0

Issues (30d)

0

Star History

5 stars in the last 30 days

Explore Similar Projects

Starred by

Jiaming Song

Jiaming Song(Chief Scientist at Luma AI).

MiraData by mira-space

Video dataset for long video generation research

Created 1 year ago

Updated 1 year ago

VideoChat-Flash by OpenGVLab

Video modeling research paper with hierarchical compression for long contexts

Created 1 year ago

Updated 1 month ago

Flash-VStream by IVGSZ

Real-time VLM for long video streams

Created 1 year ago

Updated 2 months ago

LongVA by EvolvingLMMs-Lab

Vision-language model for long context understanding

Created 1 year ago

Updated 9 months ago

FreeNoise by AILab-CVC

Tuning-free method for longer video diffusion via noise rescheduling

Created 2 years ago

Updated 4 months ago

LLaVA-Mini by ictnlp

Research paper for efficient image/video understanding via large multimodal models

Created 1 year ago

Updated 6 months ago

tarsier by bytedance

Video-language model for high-quality video descriptions and video understanding

Created 1 year ago

Updated 5 months ago

MovieChat by rese1f

Video QA for long video understanding (CVPR 2024 paper)

Created 2 years ago

Updated 11 months ago

videollm-online by showlab

Streaming video LLM for online interaction within a video stream

Created 1 year ago

Updated 1 month ago

Starred by

Ying Sheng

Ying Sheng(Coauthor of SGLang).

MiniGPT4-video by Vision-CAIR

Video-language model for short and long video understanding

Created 1 year ago

Updated 1 year ago

VideoLLaMA2 by DAMO-NLP-SG

Video-LLM research paper advancing multimodal understanding

Created 1 year ago

Updated 11 months ago

Starred by

Paras Jain

Paras Jain(Cofounder of Genmo),

Jesse Clark

Jesse Clark(Cofounder of Marqo), and

2 more.

Video-LLaMA by DAMO-NLP-SG

Multimodal model for video understanding research

Created 2 years ago

Updated 1 year ago

Feedback? Help us improve.