Vary by Ucas-HaoranWei

Vision-language model research paper implementation

Created 2 years ago

1,890 stars

Top 22.8% on SourcePulse

View on GitHub

1 Expert Loves This Project

Chaoyu Yang

Founder of Bento

Project Summary

Vary provides official code for scaling the vision vocabulary of Large Vision-Language Models (LVLMs), enabling enhanced perception capabilities. It targets researchers and developers working with multimodal AI, offering a method to improve LVLM performance on diverse visual tasks.

How It Works

Vary scales the vision vocabulary by integrating a larger and more diverse set of visual concepts into LVLMs. This approach aims to improve the model's understanding and reasoning across a wider range of visual inputs, including documents and charts, by expanding its "visual vocabulary." The implementation builds upon the LLaVA framework and utilizes Qwen as the base LLM, supporting both English and Chinese.

Quick Start & Requirements

Install: Clone the repository and install dependencies using pip install e . within a Conda environment. Flash-Attention can be installed via pip install ninja and pip install flash-attn --no-build-isolation.
Prerequisites: Python 3.10, Conda, DeepSpeed, and specific model weights (e.g., CLIP-VIT-L for Vary-toy) are required.
Demo: Run inference with python vary/demo/run_qwen_vary.py.
Training: Training scripts for Vary-base and Vary-tiny are provided, utilizing DeepSpeed for distributed training.
Resources: Training requires significant computational resources, as indicated by the DeepSpeed configuration.

Highlighted Details

Official ECCV 2024 code release.
Supports both English and Chinese languages.
Includes training code for custom datasets.
Offers a "Vary-tiny" version for smaller-scale experimentation.

Maintenance & Community

The project is actively developed, with recent updates including releases for OCR models (GOT-OCR2.0) and chart parsing (OneChart). Contact is available via email for questions. The project acknowledges LLaVA and Qwen.

Licensing & Compatibility

The data, code, and checkpoints are licensed for research use only. Usage is restricted by the license agreements of LLaMA, Vicuna, GPT-4, Qwen, and LLaVA.

Limitations & Caveats

Intermediate model weights are not open-sourced. The project's licensing restricts commercial use and integration into closed-source products.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days