Vision-language model research paper implementation
Top 24.1% on sourcepulse
Vary provides official code for scaling the vision vocabulary of Large Vision-Language Models (LVLMs), enabling enhanced perception capabilities. It targets researchers and developers working with multimodal AI, offering a method to improve LVLM performance on diverse visual tasks.
How It Works
Vary scales the vision vocabulary by integrating a larger and more diverse set of visual concepts into LVLMs. This approach aims to improve the model's understanding and reasoning across a wider range of visual inputs, including documents and charts, by expanding its "visual vocabulary." The implementation builds upon the LLaVA framework and utilizes Qwen as the base LLM, supporting both English and Chinese.
Quick Start & Requirements
pip install e .
within a Conda environment. Flash-Attention can be installed via pip install ninja
and pip install flash-attn --no-build-isolation
.python vary/demo/run_qwen_vary.py
.Highlighted Details
Maintenance & Community
The project is actively developed, with recent updates including releases for OCR models (GOT-OCR2.0) and chart parsing (OneChart). Contact is available via email for questions. The project acknowledges LLaVA and Qwen.
Licensing & Compatibility
The data, code, and checkpoints are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.
Limitations & Caveats
Intermediate model weights are not open-sourced. The project's licensing restricts commercial use and integration into closed-source products.
7 months ago
1 day