MiniCPM-V by OpenBMB

MLLM for vision, speech, and multimodal live streaming on your phone

Created 1 year ago

22,598 stars

Top 1.9% on SourcePulse

View on GitHub

12 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Luis Capelo

Cofounder of Lightning AI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Elie Bursztein

Cybersecurity Lead at Google DeepMind

and 8 more!

Project Summary

MiniCPM-o 2.6 is an 8B parameter multimodal large language model (MLLM) designed for end-side deployment, capable of processing vision, speech, and text inputs to generate text and speech outputs. It aims to provide GPT-4o-level performance for applications like multimodal live streaming on mobile devices, offering advanced capabilities in vision, speech, and integrated multimodal understanding.

How It Works

MiniCPM-o 2.6 employs an end-to-end omni-modal architecture, integrating various modality encoders/decoders trained cohesively. It features a novel omni-modal live streaming mechanism with time-division multiplexing for processing sequential streams. The model also incorporates configurable speech modeling, allowing for flexible voice customization, end-to-end voice cloning, and role-playing through multimodal system prompts.

Quick Start & Requirements

Installation: Primarily through transformers library. Specific requirements for demos and forks are listed.
Prerequisites: Python 3.10+, PyTorch (>=2.0), transformers (==4.44.2), decord, librosa, moviepy. GPU with sufficient VRAM (e.g., 18GB for full model, 9GB for int4 quantization) is recommended for optimal performance.
Demos & Docs: Online demo available, local WebUI demos for chatbot and real-time voice/video calls. Technical reports and detailed usage examples are provided.

Highlighted Details

Achieves GPT-4o-level performance in vision, speech, and multimodal live streaming, outperforming proprietary models like GPT-4o-202405 on benchmarks like OpenCompass and OCRBench.
Supports bilingual real-time speech conversation with configurable voices, voice cloning, and emotion/speed/style control.
Enables multimodal live streaming on end-side devices like iPads due to superior token density, encoding 1.8M pixels into only 640 tokens.
Offers efficient inference options via llama.cpp, ollama, and vLLM, with quantized models (int4, GGUF) available.

Maintenance & Community

Developed by THUNLP and ModelBest.
Active development with frequent updates and releases.
Community channels include WeChat and Discord.

Licensing & Compatibility

Released under Apache-2.0 License.
Model weights are free for academic research and available for commercial use after registration via a questionnaire.

Limitations & Caveats

Speech output can be unstable, with potential for noisy backgrounds or unmeaningful sounds.
The model may exhibit repetitive responses to similar user queries.
Web demos hosted on overseas servers can experience high latency; local deployment is recommended.

Health Check

Last Commit

3 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

255 stars in the last 30 days