VisCPM  by OpenBMB

Multimodal model for both visual-language tasks in Chinese and English

created 2 years ago
1,063 stars

Top 36.1% on sourcepulse

GitHubView on GitHub
Project Summary

VisCPM is an open-source family of Chinese and English multimodal large models, offering both conversational (VisCPM-Chat) and text-to-image generation (VisCPM-Paint) capabilities. It targets researchers and developers seeking state-of-the-art bilingual multimodal performance, leveraging the 10B parameter CPM-Bee LLM.

How It Works

VisCPM integrates a visual encoder (Muffin) and a visual decoder (Diffusion-UNet) with the CPM-Bee LLM. VisCPM-Chat is trained on English multimodal data and fine-tuned with English and translated Chinese instruction data, enabling strong cross-lingual generalization. VisCPM-Paint uses CPM-Bee as a text encoder and a UNet as an image decoder, initialized with Stable Diffusion 2.1 parameters, and trained on English data, also showing good Chinese performance.

Quick Start & Requirements

  • Install: Clone the repository, create a conda environment (python=3.10), and install dependencies (pip install torch>=1.10, pip install -r requirements.txt).
  • Prerequisites: PyTorch, Python 3.10+.
  • Resources: Low-resource inference is supported, with VisCPM-Chat requiring as little as 5GB VRAM and VisCPM-Paint 17GB VRAM using BMInf.
  • Demos & Docs: Online demos and API usage guides are available.

Highlighted Details

  • Achieves state-of-the-art performance among Chinese open-source multimodal models.
  • Supports both multimodal dialogue (image-to-text) and text-to-image generation.
  • Demonstrates strong bilingual capabilities, with models trained primarily on English data generalizing well to Chinese.
  • Offers fine-tuning scripts for adapting models to specific use cases.

Maintenance & Community

The project is actively updated, with recent releases including MiniCPM-V 2.0 and OmniLMM. The VisCPM paper was accepted as a spotlight at ICLR 2024. Community support channels are not explicitly mentioned, but Huggingface integration is provided.

Licensing & Compatibility

VisCPM models are licensed under a "General Model License Agreement - Source Attribution - Publicity Restriction - Non-Commercial" allowing personal and research use. Commercial use requires contacting cpm@modelbest.cn for licensing. The CPM-Bee base model has commercial licensing with similar contact requirements.

Limitations & Caveats

The safety modules are not perfect and may have false positives or negatives. Fine-tuning code is currently tested only on Linux. The project roadmap indicates planned support for model quantization.

Health Check
Last commit

1 year ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

X-LLM by phellonchen

0.3%
312
Multimodal LLM research paper
created 2 years ago
updated 2 years ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
4 more.

open_flamingo by mlfoundations

0.1%
4k
Open-source framework for training large multimodal models
created 2 years ago
updated 11 months ago
Feedback? Help us improve.