LongVU by Vision-CAIR

Video-language model for long video understanding

created 9 months ago

392 stars

Top 74.5% on sourcepulse

Project Summary

LongVU addresses the challenge of understanding long videos by introducing a spatiotemporal adaptive compression technique. This method enables efficient processing of extended video content for language-based understanding tasks, targeting researchers and developers working with video-language models.

How It Works

LongVU employs a spatiotemporal adaptive compression strategy to handle long videos. It leverages a combination of vision encoders (SigLIP, DINOv2) and language backbones (Qwen2, Llama3.2), inspired by LLaVA and Cambrian architectures. The adaptive compression allows the model to focus on salient temporal and spatial information, reducing computational overhead while preserving crucial details for accurate video-language understanding.

Quick Start & Requirements

Install: Clone the repository, create a conda environment (conda create -n longvu python=3.10), activate it (conda activate longvu), and install requirements (pip install -r requirements.txt).
Prerequisites: PyTorch, decord. Requires a minimum of 40GB GPU VRAM for local demo inference.
Demo: Run python app.py locally.
Resources: Download checkpoints for Qwen2 or Llama3.2 models.
Links: HF Demo, Windows Instructions

Highlighted Details

Official PyTorch implementation for the ICML 2025 paper "LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding".
Supports Qwen2-7B and Llama3-2.3B language backbones.
Provides scripts for image and video fine-tuning.
Detailed evaluation code available in eval.md.

Maintenance & Community

The project is associated with Vision-CAIR.
Citation provided for the research paper.

Licensing & Compatibility

The repository does not explicitly state a license in the README.

Limitations & Caveats

The README does not specify a license, which may impact commercial use or closed-source integration.
Training scripts are optimized for 64 H100-96G GPUs, suggesting significant hardware requirements for custom training.

Health Check

Last commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)

0

Issues (30d)

0

Star History

26 stars in the last 90 days

Explore Similar Projects

VideoGPT-plus by mbzuai-oryx

Video-language model integrating image/video encoders for enhanced video understanding

created 1 year ago

updated 2 weeks ago

finetune-Qwen2-VL by zhangfaen

Fine-tuning script for Qwen2-VL models

created 11 months ago

updated 5 months ago

PLLaVA by magic-research

Research paper for parameter-free LLaVA extension to videos

created 1 year ago

updated 1 year ago

Chat-UniVi by PKU-YuanGroup

Research paper for multimodal image and video understanding with LLMs

created 1 year ago

updated 9 months ago

LLaMA-VID by dvlab-research

Multimodal LLM for long videos, based on LLaVA

created 1 year ago

updated 1 year ago

VideoLLaMA3 by DAMO-NLP-SG

Multimodal foundation model for image/video understanding

created 6 months ago

updated 2 months ago

musubi-tuner by kohya-ss

LoRA training/inference scripts for video diffusion models

created 7 months ago

updated 3 days ago

perception_models by facebookresearch

Image/video perception models and multimodal LLMs

created 3 months ago

updated 2 weeks ago

FastVideo by hao-ai-lab

Framework for accelerated video generation

created 9 months ago

updated 20 hours ago

Starred by

Aravind Srinivas

Aravind Srinivas(Cofounder of Perplexity) and

Chenlin Meng

Chenlin Meng(Cofounder of Pika).

VideoGPT by wilson1yan

Video generation research paper using VQ-VAE and Transformers

created 4 years ago

updated 10 months ago

Starred by

Omar Sanseviero

Omar Sanseviero(DevRel at Google DeepMind).

CogVLM2 by zai-org

Multimodal model for image and video understanding, GPT4V-level performance

created 1 year ago

updated 5 months ago

LTX-Video by Lightricks

DiT-based video generation model for high-quality, real-time video creation

created 8 months ago

updated 1 week ago

Feedback? Help us improve.