VTimeLLM by huangb23

Video LLM for fine-grained video moment understanding

Created 2 years ago

294 stars

Top 90.1% on SourcePulse

Project Summary

VTimeLLM is a PyTorch implementation for fine-grained video moment understanding and temporal reasoning, targeting researchers and developers in video-language modeling. It offers enhanced temporal awareness and intent alignment for LLMs processing video content.

How It Works

VTimeLLM employs a novel boundary-aware three-stage training strategy. It first aligns features using image-text pairs, then enhances temporal-boundary awareness with multi-event videos and temporal QA, and finally refines temporal understanding and human intent alignment through instruction tuning on high-quality dialogue datasets. This approach aims to outperform existing Video LLMs in fine-grained temporal tasks.

Quick Start & Requirements

Install via pip install -r requirements.txt within a conda environment (python=3.10).
Additional packages for training: pip install ninja flash-attn --no-build-isolation.
Offline demo instructions are available in offline_demo.md.
Training instructions are in train.md.

Highlighted Details

Official PyTorch implementation for CVPR'2024 Highlight paper "VTimeLLM: Empower LLM to Grasp Video Moments".
Supports LLAMA and ChatGLM3 architectures, with a Chinese version fine-tuned on ChatGLM3-6b.
Claims superior performance over existing Video LLMs in fine-grained temporal tasks.
Released models, datasets, and extracted features.

Maintenance & Community

Recent updates include code refactoring for LLAMA and ChatGLM3 support and a Chinese fine-tuned version.
Relies on and acknowledges several foundational projects like LLaVA, FastChat, Video-ChatGPT, LLaMA, Vid2seq, and InternVid.

Licensing & Compatibility

Licensed under Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.
The non-commercial restriction may limit use in commercial applications.

Limitations & Caveats

The project is released under a non-commercial license, restricting its use in commercial products. Specific hardware requirements for training or running the models are not detailed in the README.

VTimeLLM by huangb23

Explore Similar Projects

PAM by Perceive-Anything

Awesome_Long_Form_Video_Understanding by ttengwang

Keye by Kwai-Keye

MotionLLM by IDEA-Research

LongVA by EvolvingLMMs-Lab

VideoGPT-plus by mbzuai-oryx

tarsier by bytedance

UniVTG by showlab

Chat-UniVi by PKU-YuanGroup

Emu3 by baaivision

Sa2VA by bytedance

Video-LLaMA by DAMO-NLP-SG