MiniCPM-V  by OpenBMB

MLLM for vision, speech, and multimodal live streaming on your phone

Created 1 year ago
21,758 stars

Top 2.0% on SourcePulse

GitHubView on GitHub
Project Summary

MiniCPM-o 2.6 is an 8B parameter multimodal large language model (MLLM) designed for end-side deployment, capable of processing vision, speech, and text inputs to generate text and speech outputs. It aims to provide GPT-4o-level performance for applications like multimodal live streaming on mobile devices, offering advanced capabilities in vision, speech, and integrated multimodal understanding.

How It Works

MiniCPM-o 2.6 employs an end-to-end omni-modal architecture, integrating various modality encoders/decoders trained cohesively. It features a novel omni-modal live streaming mechanism with time-division multiplexing for processing sequential streams. The model also incorporates configurable speech modeling, allowing for flexible voice customization, end-to-end voice cloning, and role-playing through multimodal system prompts.

Quick Start & Requirements

  • Installation: Primarily through transformers library. Specific requirements for demos and forks are listed.
  • Prerequisites: Python 3.10+, PyTorch (>=2.0), transformers (==4.44.2), decord, librosa, moviepy. GPU with sufficient VRAM (e.g., 18GB for full model, 9GB for int4 quantization) is recommended for optimal performance.
  • Demos & Docs: Online demo available, local WebUI demos for chatbot and real-time voice/video calls. Technical reports and detailed usage examples are provided.

Highlighted Details

  • Achieves GPT-4o-level performance in vision, speech, and multimodal live streaming, outperforming proprietary models like GPT-4o-202405 on benchmarks like OpenCompass and OCRBench.
  • Supports bilingual real-time speech conversation with configurable voices, voice cloning, and emotion/speed/style control.
  • Enables multimodal live streaming on end-side devices like iPads due to superior token density, encoding 1.8M pixels into only 640 tokens.
  • Offers efficient inference options via llama.cpp, ollama, and vLLM, with quantized models (int4, GGUF) available.

Maintenance & Community

  • Developed by THUNLP and ModelBest.
  • Active development with frequent updates and releases.
  • Community channels include WeChat and Discord.

Licensing & Compatibility

  • Released under Apache-2.0 License.
  • Model weights are free for academic research and available for commercial use after registration via a questionnaire.

Limitations & Caveats

  • Speech output can be unstable, with potential for noisy backgrounds or unmeaningful sounds.
  • The model may exhibit repetitive responses to similar user queries.
  • Web demos hosted on overseas servers can experience high latency; local deployment is recommended.
Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
6
Issues (30d)
72
Star History
1,752 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.