Video-XL by VectorSpaceLab

VLM for hour-scale video understanding (research paper)

Created 1 year ago

610 stars

Top 53.7% on SourcePulse

Project Summary

This repository provides Video-XL, a family of efficient Vision-Language Models (VLMs) designed for understanding extremely long videos, including hour-scale content. It targets researchers and practitioners in video analysis and multimodal AI, offering a novel approach to handle extended temporal data.

How It Works

Video-XL employs a reconstructive token compression strategy to efficiently process thousands of video frames. This method, detailed in Video-XL-Pro, reduces the computational and memory footprint, enabling models with fewer parameters (e.g., 3B) to achieve strong performance on long-form video understanding tasks.

Quick Start & Requirements

Installation: Codebase is provided; specific installation commands are not detailed in the README.
Prerequisites: Requires access to model weights and potentially large datasets for training/evaluation. Specific hardware requirements (e.g., 80GB GPU for Video-XL-Pro) are mentioned.
Resources: Video-XL-Pro can process 10,000 frames on an 80GB GPU.
Links:
- Citation: https://arxiv.org/abs/2409.14485, https://arxiv.org/abs/2503.18478
- Base Codebase: LongVA
- Evaluation Codebase: LMMs-Eval

Highlighted Details

Achieves hour-scale video understanding capabilities.
Video-XL-Pro processes 10,000 frames on an 80GB GPU with a 3B parameter model.
Selected for Oral presentation at CVPR 2025.
Training data for Video-XL-Pro is released.

Maintenance & Community

Project is actively developed with recent updates in April 2025.
Mentions CVPR 2025 acceptance.
No explicit community links (Discord, Slack) are provided in the README.

Licensing & Compatibility

Project content is licensed under Apache License 2.0.
Utilizes datasets and checkpoints subject to their original licenses; users must comply with these.
Apache 2.0 is generally permissive for commercial use and closed-source linking.

Limitations & Caveats

The README indicates that specific datasets and checkpoints have their own licensing terms, which users must adhere to, potentially creating compatibility complexities. Detailed installation and usage instructions beyond the core concepts are not fully elaborated.

Video-XL by VectorSpaceLab

Explore Similar Projects

Video-T1 by liuff19

VideoChat-Flash by OpenGVLab

LongVA by EvolvingLMMs-Lab

Flash-VStream by IVGSZ

Long-VITA by VITA-MLLM

LLaVA-Mini by ictnlp

MovieChat by rese1f

CogVLM2 by zai-org

FastVideo by hao-ai-lab

Pyramid-Flow by jy0205

Step-Video-T2V by stepfun-ai

HunyuanVideo by Tencent-Hunyuan