Discover and explore top open-source AI tools and projects—updated daily.
ShareGPT4OmniImproving large multi-modal models with enhanced image-text understanding
Top 99.3% on SourcePulse
Summary
ShareGPT4V addresses the challenge of improving Large Multi-modal Models (LMMs) by introducing a large-scale, highly descriptive image-text dataset and a novel training methodology. Targeting researchers and engineers in the AI domain, the project offers a superior LMM and a capable general image captioner, enhancing visual understanding capabilities.
How It Works
The project leverages the ShareGPT4V dataset, comprising over 1.2 million high-quality captions, including 100K generated by GPT-4-Vision. Its core approach involves a two-stage training process: feature alignment to synchronize visual and textual modalities, followed by visual instruction tuning to imbue the model with multimodal instruction-following capabilities. This methodology yields a superior LMM, ShareGPT4V-7B, and a general image captioner, ShareCaptioner, that approaches GPT-4-Vision's performance.
Quick Start & Requirements
Installation requires cloning the repository, setting up a Conda environment with Python 3.10, and installing the project with its training dependencies (pip install -e . and pip install -e ".[train]"). GPU acceleration, specifically CUDA, is a prerequisite, as indicated by the training setup utilizing A100 GPUs. The project provides pre-trained models and demos on platforms like HuggingFace and OpenXLab.
Highlighted Details
Maintenance & Community
The project is associated with authors from the University of Science and Technology of China and Shanghai AI Laboratory. While specific community channels like Discord/Slack are not listed, demos and model checkpoints are available on HuggingFace and OpenXLab, facilitating interaction and adoption.
Licensing & Compatibility
The data and checkpoints are licensed strictly for research purposes only. Usage is further restricted by the licenses of base models like LLaMA and Vicuna. The ShareGPT4V dataset itself is licensed under CC BY-NC 4.0, permitting only non-commercial use. Consequently, models trained on this dataset are also restricted to research applications and are not compatible with commercial use.
Limitations & Caveats
The primary limitation is the strict non-commercial, research-only usage restriction imposed by the dataset's CC BY-NC 4.0 license and the underlying base model licenses. This prohibits any deployment or use in commercial products or services.
1 year ago
Inactive
kohjingyu
kohjingyu
rmokady
QwenLM