VisCPM  by OpenBMB

Multimodal model for both visual-language tasks in Chinese and English

Created 2 years ago
1,067 stars

Top 35.5% on SourcePulse

GitHubView on GitHub
Project Summary

VisCPM is an open-source family of Chinese and English multimodal large models, offering both conversational (VisCPM-Chat) and text-to-image generation (VisCPM-Paint) capabilities. It targets researchers and developers seeking state-of-the-art bilingual multimodal performance, leveraging the 10B parameter CPM-Bee LLM.

How It Works

VisCPM integrates a visual encoder (Muffin) and a visual decoder (Diffusion-UNet) with the CPM-Bee LLM. VisCPM-Chat is trained on English multimodal data and fine-tuned with English and translated Chinese instruction data, enabling strong cross-lingual generalization. VisCPM-Paint uses CPM-Bee as a text encoder and a UNet as an image decoder, initialized with Stable Diffusion 2.1 parameters, and trained on English data, also showing good Chinese performance.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (python=3.10), and install dependencies (pip install torch>=1.10, pip install -r requirements.txt).
  • Prerequisites: PyTorch, Python 3.10+.
  • Resources: Low-resource inference is supported, with VisCPM-Chat requiring as little as 5GB VRAM and VisCPM-Paint 17GB VRAM using BMInf.
  • Demos & Docs: Online demos and API usage guides are available.

Highlighted Details

  • Achieves state-of-the-art performance among Chinese open-source multimodal models.
  • Supports both multimodal dialogue (image-to-text) and text-to-image generation.
  • Demonstrates strong bilingual capabilities, with models trained primarily on English data generalizing well to Chinese.
  • Offers fine-tuning scripts for adapting models to specific use cases.

Maintenance & Community

The project is actively updated, with recent releases including MiniCPM-V 2.0 and OmniLMM. The VisCPM paper was accepted as a spotlight at ICLR 2024. Community support channels are not explicitly mentioned, but Huggingface integration is provided.

Licensing & Compatibility

VisCPM models are licensed under a "General Model License Agreement - Source Attribution - Publicity Restriction - Non-Commercial" allowing personal and research use. Commercial use requires contacting cpm@modelbest.cn for licensing. The CPM-Bee base model has commercial licensing with similar contact requirements.

Limitations & Caveats

The safety modules are not perfect and may have false positives or negatives. Fine-tuning code is currently tested only on Linux. The project roadmap indicates planned support for model quantization.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems") and Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind).

gill by kohjingyu

0%
463
Multimodal LLM for generating/retrieving images and generating text
Created 2 years ago
Updated 1 year ago
Feedback? Help us improve.