Vision-language model for long context understanding
Top 75.2% on sourcepulse
LongVA is a multimodal large language model designed to process exceptionally long visual contexts, enabling zero-shot transfer of long-context capabilities from language to vision. It is suitable for researchers and practitioners working with extended video or image sequences, offering state-of-the-art performance on benchmarks like Video-MME.
How It Works
LongVA leverages a novel approach to handle thousands of visual tokens by integrating a specialized vision encoder with a large language model. It processes long sequences by sampling frames from videos or segments from images, tokenizing them, and feeding them into the LLM. This method allows for efficient processing of up to 2000 frames or over 200K visual tokens, maintaining context over extended durations.
Quick Start & Requirements
lmms-lab/LongVA-7B-DPO
model is available.lmms-eval
for both image and video tasks.Highlighted Details
Maintenance & Community
The project is actively developed, with recent updates including the release of training code for vision-text alignment. Further details and community interaction can be found via the provided blog and Hugging Face links.
Licensing & Compatibility
The project is released under a permissive license, allowing for commercial use and integration with closed-source applications.
Limitations & Caveats
The codebase is tested on specific hardware (CUDA 11.8, A100-SXM-80G), and performance may vary on different configurations. Processing very long sequences requires significant GPU memory.
4 months ago
1 week