MiniCPM  by OpenBMB

Ultra-efficient LLMs for end devices, achieving 5x+ speedup

created 1 year ago
8,142 stars

Top 6.5% on sourcepulse

GitHubView on GitHub
Project Summary

MiniCPM offers a suite of highly efficient, small-parameter language models designed for deployment on end devices. It addresses the need for powerful yet resource-constrained AI capabilities, targeting developers and researchers seeking performant LLMs for edge computing and applications requiring low latency. The models demonstrate competitive performance against larger counterparts, enabling advanced AI features on consumer hardware.

How It Works

MiniCPM models leverage a novel architecture and training strategies to achieve high efficiency. Key innovations include the development of the MiniCPM-S variant, which achieves significant FLOP reduction through sparse FFN layers (up to 87.89% sparsity), and the introduction of LLMxMapReduce for theoretically infinite context length processing. These techniques allow the models to maintain strong performance while drastically reducing computational requirements.

Quick Start & Requirements

  • Hugging Face: pip install transformers accelerate torch and use AutoModelForCausalLM.from_pretrained('openbmb/MiniCPM3-4B', torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
  • SGLang: pip install sglang (from source recommended for latest optimizations).
  • vLLM: pip install vllm>=0.6.2.
  • llama.cpp: GGUF models available; requires make to build.
  • Hardware: GPU recommended for optimal performance (e.g., CUDA).

Highlighted Details

  • MiniCPM3-4B surpasses Phi-3.5-mini and rivals GPT-3.5-Turbo, while competing with 7B-9B models like Llama3.1-8B-Instruct.
  • Features tool calling (SOTA on BFCL for models <9B) and code interpreter capabilities.
  • Achieves state-of-the-art performance on long-context benchmarks (InfiniteBench) with LLMxMapReduce, outperforming GPT-4 and KimiChat.
  • Offers RAG capabilities with MiniCPM-Embedding and MiniCPM-Reranker achieving SOTA in cross-lingual retrieval.

Maintenance & Community

The project is actively developed by OpenBMB, with significant contributions and integrations noted with SGLang and llama.cpp. Community engagement is encouraged via Discord and WeChat groups.

Licensing & Compatibility

Code is licensed under Apache-2.0. Model weights require adherence to the MiniCPM Model Commercial License Agreement, with free commercial use granted after registration via a questionnaire. Academic research use is fully open.

Limitations & Caveats

While MiniCPM models are highly efficient, performance can vary based on specific hardware and inference frameworks. The README notes that for non-MiniCPM models, vLLM version 0.2.7 is used, while MiniCPM implementations are based on vLLM 0.2.2, suggesting potential compatibility nuances.

Health Check
Last commit

3 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
6
Star History
852 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems).

JittorLLMs by Jittor

0%
2k
Low-resource LLM inference library
created 2 years ago
updated 5 months ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Philipp Schmid Philipp Schmid(DevRel at Google DeepMind), and
2 more.

LightLLM by ModelTC

0.7%
3k
Python framework for LLM inference and serving
created 2 years ago
updated 15 hours ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Feedback? Help us improve.