Long-context visual language model for million-token processing
Top 91.6% on sourcepulse
Long-VITA is a large multi-modal model designed to process extremely long contexts, exceeding one million tokens, for both image and video understanding tasks. It targets researchers and developers working with extensive visual data, offering state-of-the-art performance on benchmarks like Video-MME for models under 20B parameters.
How It Works
Long-VITA achieves its long-context capabilities through an unspecified architectural innovation that enables processing over 1 million visual tokens. The model is trained on a dataset of 17 million publicly available samples, focusing on open-source data. It utilizes a Logits-Masked LM Head, which is highlighted as a key component for its effectiveness.
Quick Start & Requirements
Highlighted Details
Maintenance & Community
The project has recently added an online demo and support in VLMEvalKit (OpenCompass). Training and inference code, logs, deployment code, and model weights are released.
Licensing & Compatibility
The README does not explicitly state the license type or any compatibility notes for commercial use.
Limitations & Caveats
The project's primary focus is on Ascend NPU and Nvidia GPU architectures, with specific framework support (MindSpeed, Megatron, DeepSpeed). Compatibility with other hardware or frameworks is not detailed.
2 months ago
1 day