ALLaVA by FreedomIntelligence

Vision-language model research using GPT4V-synthesized data

Created 1 year ago

280 stars

Top 93.0% on SourcePulse

Project Summary

ALLaVA provides a large-scale dataset (1.4M samples) and pre-trained models for training Lite Vision-Language Models (LVLMs). It addresses the need for high-quality, diverse data synthesized using GPT-4V, enabling the development of more capable and efficient multimodal AI systems. The project targets researchers and developers working on vision-language tasks.

How It Works

ALLaVA leverages GPT-4V to generate captions and complex reasoning question-answer pairs from image datasets like LAION and Vision-FLAN. This approach aims to create a rich dataset that captures nuanced visual understanding and reasoning capabilities, which are crucial for advanced LVLMs. The synthesized data is structured into distinct subsets for captioning and instruction-following tasks.

Quick Start & Requirements

Dataset Loading: Models and datasets can be loaded from HuggingFace using the .from_pretrained() method.
Data Preparation: Scripts are provided for downloading LAION, Vision-FLAN, and Evol-Instruct-GPT4-Turbo data.
Inference: Example scripts are available for inference.
Training: Code is based on LLaVA, requiring standard deep learning environment setup (Python, PyTorch).

Highlighted Details

1.4M GPT-4V-synthesized data samples covering captioning and instruction-following.
Multiple pre-trained models released, including ALLaVA-Phi3-mini-128k, ALLaVA-StableLM2-1.6B, and ALLaVA-Phi2-2.7B.
Competitive benchmark results reported across 17 diverse vision-language tasks, outperforming other 4B-scale LVLMs on several metrics.
Dataset includes image files and structured JSON data for easy integration.

Maintenance & Community

The project is led by Guiming Hardy Chen and involves contributors from The Chinese University of Hong Kong, Shenzhen and Shenzhen Research Institute of Big Data. The project has a public arXiv paper and HuggingFace repositories for models and datasets.

Licensing & Compatibility

The project's license is not explicitly stated in the README. However, the use of LLaVA as a base and the availability of models on HuggingFace suggest a permissive open-source license, likely compatible with commercial use. Users should verify the specific license for each component.

Limitations & Caveats

The project states it does not own the rights to the images included in the images.zip file, which are provided to facilitate data preparation. Users should be mindful of potential image usage restrictions.

ALLaVA by FreedomIntelligence

Explore Similar Projects

phi3-Chinese by CrazyBoyM

Visual-CoT by deepcs233

RLAIF-V by RLHF-V

lmms-finetune by zjysteven

awesome-vlm-architectures by gokayfem

PaLM by conceptofmind

TinyGPT-V by DLYuanGod

molmo by allenai

OLMo-core by allenai

MoE-LLaVA by PKU-YuanGroup

InternVL by OpenGVLab

transformers by huggingface