Discover and explore top open-source AI tools and projects—updated daily.
Multimodal LLM for text-rich visual question answering (AAAI 2024 paper)
Top 97.5% on SourcePulse
BLIVA is a multimodal large language model designed to excel at visual question answering, particularly for images containing significant amounts of text. It targets researchers and developers working with visual understanding tasks, offering improved performance on text-rich benchmarks.
How It Works
BLIVA builds upon the BLIP-2 architecture, integrating a visual encoder with a language model. Its novelty lies in its specific training methodology and data augmentation strategies tailored for text-heavy visual question answering. This approach allows BLIVA to effectively extract and interpret textual information embedded within images, leading to more accurate responses compared to general-purpose multimodal models.
Quick Start & Requirements
pip install -e .
after cloning the repository.Highlighted Details
Maintenance & Community
The project is associated with UC San Diego and Coinbase. Key updates are announced via GitHub releases.
Licensing & Compatibility
Limitations & Caveats
The Vicuna version's commercial use is restricted by LLaMA's license. While strong on text-rich VQA, performance on general VQA benchmarks may vary.
1 year ago
1 day