MLLM for vision, speech, and multimodal live streaming on your phone
Top 2.3% on sourcepulse
MiniCPM-o 2.6 is an 8B parameter multimodal large language model (MLLM) designed for end-side deployment, capable of processing vision, speech, and text inputs to generate text and speech outputs. It aims to provide GPT-4o-level performance for applications like multimodal live streaming on mobile devices, offering advanced capabilities in vision, speech, and integrated multimodal understanding.
How It Works
MiniCPM-o 2.6 employs an end-to-end omni-modal architecture, integrating various modality encoders/decoders trained cohesively. It features a novel omni-modal live streaming mechanism with time-division multiplexing for processing sequential streams. The model also incorporates configurable speech modeling, allowing for flexible voice customization, end-to-end voice cloning, and role-playing through multimodal system prompts.
Quick Start & Requirements
transformers
library. Specific requirements for demos and forks are listed.transformers
(==4.44.2), decord
, librosa
, moviepy
. GPU with sufficient VRAM (e.g., 18GB for full model, 9GB for int4 quantization) is recommended for optimal performance.Highlighted Details
llama.cpp
, ollama
, and vLLM
, with quantized models (int4, GGUF) available.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 month ago
1 week