Vary-toy  by Ucas-HaoranWei

Code for a research paper on vision-language models

created 1 year ago
622 stars

Top 53.9% on sourcepulse

GitHubView on GitHub
Project Summary

Vary-toy provides an open-source implementation for a Small Language Model (SLM) enhanced with a reinforced vision vocabulary, targeting researchers and developers in the multimodal AI space. It aims to scale vision capabilities within SLMs, enabling advanced visual understanding tasks.

How It Works

Vary-toy integrates a vision encoder with a language model, creating a "reinforced vision vocabulary" that expands the SLM's ability to process and understand visual information. This approach allows for efficient scaling of visual features within smaller language models, offering a competitive alternative to larger, more resource-intensive models.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install -e ..
  • Prerequisites: Python 3.10, PyTorch, DeepSpeed, Flash-Attention (requires ninja).
  • Weights: Requires downloading Vary-toy weights and CLIP-VIT-L weights.
  • Demo: Run python vary/demo/run_qwen_vary.py.
  • Training: Use deepspeed Vary/train/train_qwen_vary.py with specified arguments.
  • Resources: Claims a single 1080Ti is sufficient for all features.

Highlighted Details

  • Accepted by ECCV2024.
  • Released a LAVIS codebase and Vary-600k dataset for training from scratch.
  • Supports both English and Chinese chart parsing (OneChart).
  • Developed GOT-OCR2.0, a comprehensive OCR model.

Maintenance & Community

The project is actively updated with recent acceptances at ECCV2024 and ACM MM 2024. Contact information for questions is provided via email.

Licensing & Compatibility

The data, code, and checkpoints are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Limitations & Caveats

The project is explicitly licensed for research purposes only, which may restrict commercial applications. The README notes that users should rebuild the repository if they have built the original Vary.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
5 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.