MiniCPM by OpenBMB

Ultra-efficient LLMs for end devices, achieving 5x+ speedup

Created 1 year ago

8,488 stars

Top 6.1% on SourcePulse

View on GitHub

3 Experts Love This Project

Lianmin Zheng

Coauthor of SGLang, vLLM

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

Casper Hansen

Author of AutoAWQ

Project Summary

MiniCPM offers a suite of highly efficient, small-parameter language models designed for deployment on end devices. It addresses the need for powerful yet resource-constrained AI capabilities, targeting developers and researchers seeking performant LLMs for edge computing and applications requiring low latency. The models demonstrate competitive performance against larger counterparts, enabling advanced AI features on consumer hardware.

How It Works

MiniCPM models leverage a novel architecture and training strategies to achieve high efficiency. Key innovations include the development of the MiniCPM-S variant, which achieves significant FLOP reduction through sparse FFN layers (up to 87.89% sparsity), and the introduction of LLMxMapReduce for theoretically infinite context length processing. These techniques allow the models to maintain strong performance while drastically reducing computational requirements.

Quick Start & Requirements

Hugging Face: pip install transformers accelerate torch and use AutoModelForCausalLM.from_pretrained('openbmb/MiniCPM3-4B', torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
SGLang: pip install sglang (from source recommended for latest optimizations).
vLLM: pip install vllm>=0.6.2.
llama.cpp: GGUF models available; requires make to build.
Hardware: GPU recommended for optimal performance (e.g., CUDA).

Highlighted Details

MiniCPM3-4B surpasses Phi-3.5-mini and rivals GPT-3.5-Turbo, while competing with 7B-9B models like Llama3.1-8B-Instruct.
Features tool calling (SOTA on BFCL for models <9B) and code interpreter capabilities.
Achieves state-of-the-art performance on long-context benchmarks (InfiniteBench) with LLMxMapReduce, outperforming GPT-4 and KimiChat.
Offers RAG capabilities with MiniCPM-Embedding and MiniCPM-Reranker achieving SOTA in cross-lingual retrieval.

Maintenance & Community

The project is actively developed by OpenBMB, with significant contributions and integrations noted with SGLang and llama.cpp. Community engagement is encouraged via Discord and WeChat groups.

Licensing & Compatibility

Code is licensed under Apache-2.0. Model weights require adherence to the MiniCPM Model Commercial License Agreement, with free commercial use granted after registration via a questionnaire. Academic research use is fully open.

Limitations & Caveats

While MiniCPM models are highly efficient, performance can vary based on specific hardware and inference frameworks. The README notes that for non-MiniCPM models, vLLM version 0.2.7 is used, while MiniCPM implementations are based on vLLM 0.2.2, suggesting potential compatibility nuances.

Health Check

Last Commit

3 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

35 stars in the last 30 days