VideoChat-Flash  by OpenGVLab

Video modeling research paper with hierarchical compression for long contexts

created 1 year ago
451 stars

Top 67.8% on sourcepulse

GitHubView on GitHub
Project Summary

VideoChat-Flash is a multimodal large language model designed for advanced video understanding, particularly excelling in long-context scenarios. It targets researchers and developers working with video analysis, enabling tasks like question answering, grounding, and detailed captioning on extended video inputs. The primary benefit is its ability to process and understand videos up to three hours long with high accuracy and efficiency.

How It Works

VideoChat-Flash employs a hierarchical compression strategy, encoding each video frame into a mere 16 tokens. This approach significantly reduces the computational burden and memory footprint associated with long video sequences. By compressing temporal information efficiently, the model achieves state-of-the-art performance on benchmarks like AuroraCap and demonstrates exceptional accuracy (99.1%) in needle-in-a-haystack evaluations for 10,000 frames, while maintaining inference speeds 5-10 times faster than previous models.

Quick Start & Requirements

  • Inference details are available via the Hugging Face README.
  • Evaluation can be performed using provided codes or the lmms-eval framework.
  • Training code is available, based on LLaVA for VideoChat-Flash and XTuner for finetuning InternVideo2.5.
  • Specific hardware requirements (e.g., GPU, CUDA versions) are not explicitly detailed in the README but are implied for deep learning model execution.

Highlighted Details

  • Achieves 99.1% accuracy on 10,000-frame needle-in-a-haystack evaluations.
  • Processes videos up to three hours long.
  • Encodes each video frame into only 16 tokens for high efficiency.
  • Offers multiple model variants, including a 7B model supporting 1 million tokens for ultra-long inputs.

Maintenance & Community

The project acknowledges contributions and references several open-source projects including InternVideo, UMT, Qwen, and LLaVA-VL. Specific community channels (Discord/Slack) or active maintainer information are not provided in the README.

Licensing & Compatibility

The README does not explicitly state the license type or any compatibility notes for commercial use or closed-source linking.

Limitations & Caveats

Detailed hardware and software prerequisites for setup and training are not comprehensively listed. The project appears to be actively updated with new releases and benchmark results, suggesting ongoing development.

Health Check
Last commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
5
Star History
52 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.