ALLaVA  by FreedomIntelligence

Vision-language model research using GPT4V-synthesized data

created 1 year ago
267 stars

Top 96.7% on sourcepulse

GitHubView on GitHub
Project Summary

ALLaVA provides a large-scale dataset (1.4M samples) and pre-trained models for training Lite Vision-Language Models (LVLMs). It addresses the need for high-quality, diverse data synthesized using GPT-4V, enabling the development of more capable and efficient multimodal AI systems. The project targets researchers and developers working on vision-language tasks.

How It Works

ALLaVA leverages GPT-4V to generate captions and complex reasoning question-answer pairs from image datasets like LAION and Vision-FLAN. This approach aims to create a rich dataset that captures nuanced visual understanding and reasoning capabilities, which are crucial for advanced LVLMs. The synthesized data is structured into distinct subsets for captioning and instruction-following tasks.

Quick Start & Requirements

  • Dataset Loading: Models and datasets can be loaded from HuggingFace using the .from_pretrained() method.
  • Data Preparation: Scripts are provided for downloading LAION, Vision-FLAN, and Evol-Instruct-GPT4-Turbo data.
  • Inference: Example scripts are available for inference.
  • Training: Code is based on LLaVA, requiring standard deep learning environment setup (Python, PyTorch).

Highlighted Details

  • 1.4M GPT-4V-synthesized data samples covering captioning and instruction-following.
  • Multiple pre-trained models released, including ALLaVA-Phi3-mini-128k, ALLaVA-StableLM2-1.6B, and ALLaVA-Phi2-2.7B.
  • Competitive benchmark results reported across 17 diverse vision-language tasks, outperforming other 4B-scale LVLMs on several metrics.
  • Dataset includes image files and structured JSON data for easy integration.

Maintenance & Community

The project is led by Guiming Hardy Chen and involves contributors from The Chinese University of Hong Kong, Shenzhen and Shenzhen Research Institute of Big Data. The project has a public arXiv paper and HuggingFace repositories for models and datasets.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, the use of LLaVA as a base and the availability of models on HuggingFace suggest a permissive open-source license, likely compatible with commercial use. Users should verify the specific license for each component.

Limitations & Caveats

The project states it does not own the rights to the images included in the images.zip file, which are provided to facilitate data preparation. Users should be mindful of potential image usage restrictions.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
7 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.