MiniCPM-o  by OpenBMB

MLLM for vision, speech, and multimodal live streaming on your phone

created 1 year ago
19,900 stars

Top 2.3% on sourcepulse

GitHubView on GitHub
Project Summary

MiniCPM-o 2.6 is an 8B parameter multimodal large language model (MLLM) designed for end-side deployment, capable of processing vision, speech, and text inputs to generate text and speech outputs. It aims to provide GPT-4o-level performance for applications like multimodal live streaming on mobile devices, offering advanced capabilities in vision, speech, and integrated multimodal understanding.

How It Works

MiniCPM-o 2.6 employs an end-to-end omni-modal architecture, integrating various modality encoders/decoders trained cohesively. It features a novel omni-modal live streaming mechanism with time-division multiplexing for processing sequential streams. The model also incorporates configurable speech modeling, allowing for flexible voice customization, end-to-end voice cloning, and role-playing through multimodal system prompts.

Quick Start & Requirements

  • Installation: Primarily through transformers library. Specific requirements for demos and forks are listed.
  • Prerequisites: Python 3.10+, PyTorch (>=2.0), transformers (==4.44.2), decord, librosa, moviepy. GPU with sufficient VRAM (e.g., 18GB for full model, 9GB for int4 quantization) is recommended for optimal performance.
  • Demos & Docs: Online demo available, local WebUI demos for chatbot and real-time voice/video calls. Technical reports and detailed usage examples are provided.

Highlighted Details

  • Achieves GPT-4o-level performance in vision, speech, and multimodal live streaming, outperforming proprietary models like GPT-4o-202405 on benchmarks like OpenCompass and OCRBench.
  • Supports bilingual real-time speech conversation with configurable voices, voice cloning, and emotion/speed/style control.
  • Enables multimodal live streaming on end-side devices like iPads due to superior token density, encoding 1.8M pixels into only 640 tokens.
  • Offers efficient inference options via llama.cpp, ollama, and vLLM, with quantized models (int4, GGUF) available.

Maintenance & Community

  • Developed by THUNLP and ModelBest.
  • Active development with frequent updates and releases.
  • Community channels include WeChat and Discord.

Licensing & Compatibility

  • Released under Apache-2.0 License.
  • Model weights are free for academic research and available for commercial use after registration via a questionnaire.

Limitations & Caveats

  • Speech output can be unstable, with potential for noisy backgrounds or unmeaningful sounds.
  • The model may exhibit repetitive responses to similar user queries.
  • Web demos hosted on overseas servers can experience high latency; local deployment is recommended.
Health Check
Last commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
11
Star History
688 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Lianmin Zheng Lianmin Zheng(Author of SGLang).

fish-speech by fishaudio

0.3%
23k
Open-source TTS for multilingual speech synthesis
created 1 year ago
updated 1 week ago
Feedback? Help us improve.