Vary  by Ucas-HaoranWei

Vision-language model research paper implementation

created 1 year ago
1,837 stars

Top 24.1% on sourcepulse

GitHubView on GitHub
Project Summary

Vary provides official code for scaling the vision vocabulary of Large Vision-Language Models (LVLMs), enabling enhanced perception capabilities. It targets researchers and developers working with multimodal AI, offering a method to improve LVLM performance on diverse visual tasks.

How It Works

Vary scales the vision vocabulary by integrating a larger and more diverse set of visual concepts into LVLMs. This approach aims to improve the model's understanding and reasoning across a wider range of visual inputs, including documents and charts, by expanding its "visual vocabulary." The implementation builds upon the LLaVA framework and utilizes Qwen as the base LLM, supporting both English and Chinese.

Quick Start & Requirements

  • Install: Clone the repository and install dependencies using pip install e . within a Conda environment. Flash-Attention can be installed via pip install ninja and pip install flash-attn --no-build-isolation.
  • Prerequisites: Python 3.10, Conda, DeepSpeed, and specific model weights (e.g., CLIP-VIT-L for Vary-toy) are required.
  • Demo: Run inference with python vary/demo/run_qwen_vary.py.
  • Training: Training scripts for Vary-base and Vary-tiny are provided, utilizing DeepSpeed for distributed training.
  • Resources: Training requires significant computational resources, as indicated by the DeepSpeed configuration.

Highlighted Details

  • Official ECCV 2024 code release.
  • Supports both English and Chinese languages.
  • Includes training code for custom datasets.
  • Offers a "Vary-tiny" version for smaller-scale experimentation.

Maintenance & Community

The project is actively developed, with recent updates including releases for OCR models (GOT-OCR2.0) and chart parsing (OneChart). Contact is available via email for questions. The project acknowledges LLaVA and Qwen.

Licensing & Compatibility

The data, code, and checkpoints are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Limitations & Caveats

Intermediate model weights are not open-sourced. The project's licensing restricts commercial use and integration into closed-source products.

Health Check
Last commit

7 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
17 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.