Code for a research paper on vision-language models
Top 53.9% on sourcepulse
Vary-toy provides an open-source implementation for a Small Language Model (SLM) enhanced with a reinforced vision vocabulary, targeting researchers and developers in the multimodal AI space. It aims to scale vision capabilities within SLMs, enabling advanced visual understanding tasks.
How It Works
Vary-toy integrates a vision encoder with a language model, creating a "reinforced vision vocabulary" that expands the SLM's ability to process and understand visual information. This approach allows for efficient scaling of visual features within smaller language models, offering a competitive alternative to larger, more resource-intensive models.
Quick Start & Requirements
pip install -e .
.ninja
).python vary/demo/run_qwen_vary.py
.deepspeed Vary/train/train_qwen_vary.py
with specified arguments.Highlighted Details
Maintenance & Community
The project is actively updated with recent acceptances at ECCV2024 and ACM MM 2024. Contact information for questions is provided via email.
Licensing & Compatibility
The data, code, and checkpoints are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.
Limitations & Caveats
The project is explicitly licensed for research purposes only, which may restrict commercial applications. The README notes that users should rebuild the repository if they have built the original Vary.
7 months ago
1 day