VideoChat-Flash by OpenGVLab

Video modeling research paper with hierarchical compression for long contexts

Created 1 year ago

493 stars

Top 62.8% on SourcePulse

Project Summary

VideoChat-Flash is a multimodal large language model designed for advanced video understanding, particularly excelling in long-context scenarios. It targets researchers and developers working with video analysis, enabling tasks like question answering, grounding, and detailed captioning on extended video inputs. The primary benefit is its ability to process and understand videos up to three hours long with high accuracy and efficiency.

How It Works

VideoChat-Flash employs a hierarchical compression strategy, encoding each video frame into a mere 16 tokens. This approach significantly reduces the computational burden and memory footprint associated with long video sequences. By compressing temporal information efficiently, the model achieves state-of-the-art performance on benchmarks like AuroraCap and demonstrates exceptional accuracy (99.1%) in needle-in-a-haystack evaluations for 10,000 frames, while maintaining inference speeds 5-10 times faster than previous models.

Quick Start & Requirements

Inference details are available via the Hugging Face README.
Evaluation can be performed using provided codes or the lmms-eval framework.
Training code is available, based on LLaVA for VideoChat-Flash and XTuner for finetuning InternVideo2.5.
Specific hardware requirements (e.g., GPU, CUDA versions) are not explicitly detailed in the README but are implied for deep learning model execution.

Highlighted Details

Achieves 99.1% accuracy on 10,000-frame needle-in-a-haystack evaluations.
Processes videos up to three hours long.
Encodes each video frame into only 16 tokens for high efficiency.
Offers multiple model variants, including a 7B model supporting 1 million tokens for ultra-long inputs.

Maintenance & Community

The project acknowledges contributions and references several open-source projects including InternVideo, UMT, Qwen, and LLaVA-VL. Specific community channels (Discord/Slack) or active maintainer information are not provided in the README.

Licensing & Compatibility

The README does not explicitly state the license type or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

Detailed hardware and software prerequisites for setup and training are not comprehensively listed. The project appears to be actively updated with new releases and benchmark results, suggesting ongoing development.

VideoChat-Flash by OpenGVLab

Explore Similar Projects

Flash-VStream by IVGSZ

LongVA by EvolvingLMMs-Lab

LLaVA-Mini by ictnlp

TimeChat by RenShuhuai-Andy

Video-XL by VectorSpaceLab

MovieChat by rese1f

LLaMA-VID by JIA-Lab-research

videollm-online by showlab

Allegro by rhymes-ai

CogVLM2 by zai-org

FastVideo by hao-ai-lab

Step-Video-T2V by stepfun-ai