Multimodal model for both visual-language tasks in Chinese and English
Top 36.1% on sourcepulse
VisCPM is an open-source family of Chinese and English multimodal large models, offering both conversational (VisCPM-Chat) and text-to-image generation (VisCPM-Paint) capabilities. It targets researchers and developers seeking state-of-the-art bilingual multimodal performance, leveraging the 10B parameter CPM-Bee LLM.
How It Works
VisCPM integrates a visual encoder (Muffin) and a visual decoder (Diffusion-UNet) with the CPM-Bee LLM. VisCPM-Chat is trained on English multimodal data and fine-tuned with English and translated Chinese instruction data, enabling strong cross-lingual generalization. VisCPM-Paint uses CPM-Bee as a text encoder and a UNet as an image decoder, initialized with Stable Diffusion 2.1 parameters, and trained on English data, also showing good Chinese performance.
Quick Start & Requirements
python=3.10
), and install dependencies (pip install torch>=1.10
, pip install -r requirements.txt
).Highlighted Details
Maintenance & Community
The project is actively updated, with recent releases including MiniCPM-V 2.0 and OmniLMM. The VisCPM paper was accepted as a spotlight at ICLR 2024. Community support channels are not explicitly mentioned, but Huggingface integration is provided.
Licensing & Compatibility
VisCPM models are licensed under a "General Model License Agreement - Source Attribution - Publicity Restriction - Non-Commercial" allowing personal and research use. Commercial use requires contacting cpm@modelbest.cn for licensing. The CPM-Bee base model has commercial licensing with similar contact requirements.
Limitations & Caveats
The safety modules are not perfect and may have false positives or negatives. Fine-tuning code is currently tested only on Linux. The project roadmap indicates planned support for model quantization.
1 year ago
1 week