ShareGPT4V by ShareGPT4Omni

Improving large multi-modal models with enhanced image-text understanding

Created 2 years ago

258 stars

Top 98.0% on SourcePulse

Project Summary

Summary

ShareGPT4V addresses the challenge of improving Large Multi-modal Models (LMMs) by introducing a large-scale, highly descriptive image-text dataset and a novel training methodology. Targeting researchers and engineers in the AI domain, the project offers a superior LMM and a capable general image captioner, enhancing visual understanding capabilities.

How It Works

The project leverages the ShareGPT4V dataset, comprising over 1.2 million high-quality captions, including 100K generated by GPT-4-Vision. Its core approach involves a two-stage training process: feature alignment to synchronize visual and textual modalities, followed by visual instruction tuning to imbue the model with multimodal instruction-following capabilities. This methodology yields a superior LMM, ShareGPT4V-7B, and a general image captioner, ShareCaptioner, that approaches GPT-4-Vision's performance.

Quick Start & Requirements

Installation requires cloning the repository, setting up a Conda environment with Python 3.10, and installing the project with its training dependencies (pip install -e . and pip install -e ".[train]"). GPU acceleration, specifically CUDA, is a prerequisite, as indicated by the training setup utilizing A100 GPUs. The project provides pre-trained models and demos on platforms like HuggingFace and OpenXLab.

Highlighted Details

ShareGPT4V Dataset: A large-scale, highly descriptive image-text dataset featuring 1.2M high-quality captions.
ShareCaptioner: A general image captioner achieving capabilities close to GPT-4-Vision.
ShareGPT4V-7B Model: A superior LMM demonstrating strong performance across various benchmarks like LLaVA-Bench-Wild, MME, and MMBench.
ECCV 2024 Publication: The work is accepted for presentation at ECCV 2024.

Maintenance & Community

The project is associated with authors from the University of Science and Technology of China and Shanghai AI Laboratory. While specific community channels like Discord/Slack are not listed, demos and model checkpoints are available on HuggingFace and OpenXLab, facilitating interaction and adoption.

Licensing & Compatibility

The data and checkpoints are licensed strictly for research purposes only. Usage is further restricted by the licenses of base models like LLaMA and Vicuna. The ShareGPT4V dataset itself is licensed under CC BY-NC 4.0, permitting only non-commercial use. Consequently, models trained on this dataset are also restricted to research applications and are not compatible with commercial use.

Limitations & Caveats

The primary limitation is the strict non-commercial, research-only usage restriction imposed by the dataset's CC BY-NC 4.0 license and the underlying base model licenses. This prohibits any deployment or use in commercial products or services.

Health Check

Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days