VTimeLLM  by huangb23

Video LLM for fine-grained video moment understanding

created 1 year ago
282 stars

Top 93.5% on sourcepulse

GitHubView on GitHub
Project Summary

VTimeLLM is a PyTorch implementation for fine-grained video moment understanding and temporal reasoning, targeting researchers and developers in video-language modeling. It offers enhanced temporal awareness and intent alignment for LLMs processing video content.

How It Works

VTimeLLM employs a novel boundary-aware three-stage training strategy. It first aligns features using image-text pairs, then enhances temporal-boundary awareness with multi-event videos and temporal QA, and finally refines temporal understanding and human intent alignment through instruction tuning on high-quality dialogue datasets. This approach aims to outperform existing Video LLMs in fine-grained temporal tasks.

Quick Start & Requirements

  • Install via pip install -r requirements.txt within a conda environment (python=3.10).
  • Additional packages for training: pip install ninja flash-attn --no-build-isolation.
  • Offline demo instructions are available in offline_demo.md.
  • Training instructions are in train.md.

Highlighted Details

  • Official PyTorch implementation for CVPR'2024 Highlight paper "VTimeLLM: Empower LLM to Grasp Video Moments".
  • Supports LLAMA and ChatGLM3 architectures, with a Chinese version fine-tuned on ChatGLM3-6b.
  • Claims superior performance over existing Video LLMs in fine-grained temporal tasks.
  • Released models, datasets, and extracted features.

Maintenance & Community

  • Recent updates include code refactoring for LLAMA and ChatGLM3 support and a Chinese fine-tuned version.
  • Relies on and acknowledges several foundational projects like LLaVA, FastChat, Video-ChatGPT, LLaMA, Vid2seq, and InternVid.

Licensing & Compatibility

  • Licensed under Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License.
  • The non-commercial restriction may limit use in commercial applications.

Limitations & Caveats

The project is released under a non-commercial license, restricting its use in commercial products. Specific hardware requirements for training or running the models are not detailed in the README.

Health Check
Last commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.