LongVU  by Vision-CAIR

Video-language model for long video understanding

created 9 months ago
392 stars

Top 74.5% on sourcepulse

GitHubView on GitHub
Project Summary

LongVU addresses the challenge of understanding long videos by introducing a spatiotemporal adaptive compression technique. This method enables efficient processing of extended video content for language-based understanding tasks, targeting researchers and developers working with video-language models.

How It Works

LongVU employs a spatiotemporal adaptive compression strategy to handle long videos. It leverages a combination of vision encoders (SigLIP, DINOv2) and language backbones (Qwen2, Llama3.2), inspired by LLaVA and Cambrian architectures. The adaptive compression allows the model to focus on salient temporal and spatial information, reducing computational overhead while preserving crucial details for accurate video-language understanding.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (conda create -n longvu python=3.10), activate it (conda activate longvu), and install requirements (pip install -r requirements.txt).
  • Prerequisites: PyTorch, decord. Requires a minimum of 40GB GPU VRAM for local demo inference.
  • Demo: Run python app.py locally.
  • Resources: Download checkpoints for Qwen2 or Llama3.2 models.
  • Links: HF Demo, Windows Instructions

Highlighted Details

  • Official PyTorch implementation for the ICML 2025 paper "LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding".
  • Supports Qwen2-7B and Llama3-2.3B language backbones.
  • Provides scripts for image and video fine-tuning.
  • Detailed evaluation code available in eval.md.

Maintenance & Community

  • The project is associated with Vision-CAIR.
  • Citation provided for the research paper.

Licensing & Compatibility

  • The repository does not explicitly state a license in the README.

Limitations & Caveats

  • The README does not specify a license, which may impact commercial use or closed-source integration.
  • Training scripts are optimized for 64 H100-96G GPUs, suggesting significant hardware requirements for custom training.
Health Check
Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
26 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.