MiniCPM  by OpenBMB

Ultra-efficient LLMs for end devices, achieving 5x+ speedup

Created 1 year ago
8,346 stars

Top 6.2% on SourcePulse

GitHubView on GitHub
Project Summary

MiniCPM offers a suite of highly efficient, small-parameter language models designed for deployment on end devices. It addresses the need for powerful yet resource-constrained AI capabilities, targeting developers and researchers seeking performant LLMs for edge computing and applications requiring low latency. The models demonstrate competitive performance against larger counterparts, enabling advanced AI features on consumer hardware.

How It Works

MiniCPM models leverage a novel architecture and training strategies to achieve high efficiency. Key innovations include the development of the MiniCPM-S variant, which achieves significant FLOP reduction through sparse FFN layers (up to 87.89% sparsity), and the introduction of LLMxMapReduce for theoretically infinite context length processing. These techniques allow the models to maintain strong performance while drastically reducing computational requirements.

Quick Start & Requirements

  • Hugging Face: pip install transformers accelerate torch and use AutoModelForCausalLM.from_pretrained('openbmb/MiniCPM3-4B', torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
  • SGLang: pip install sglang (from source recommended for latest optimizations).
  • vLLM: pip install vllm>=0.6.2.
  • llama.cpp: GGUF models available; requires make to build.
  • Hardware: GPU recommended for optimal performance (e.g., CUDA).

Highlighted Details

  • MiniCPM3-4B surpasses Phi-3.5-mini and rivals GPT-3.5-Turbo, while competing with 7B-9B models like Llama3.1-8B-Instruct.
  • Features tool calling (SOTA on BFCL for models <9B) and code interpreter capabilities.
  • Achieves state-of-the-art performance on long-context benchmarks (InfiniteBench) with LLMxMapReduce, outperforming GPT-4 and KimiChat.
  • Offers RAG capabilities with MiniCPM-Embedding and MiniCPM-Reranker achieving SOTA in cross-lingual retrieval.

Maintenance & Community

The project is actively developed by OpenBMB, with significant contributions and integrations noted with SGLang and llama.cpp. Community engagement is encouraged via Discord and WeChat groups.

Licensing & Compatibility

Code is licensed under Apache-2.0. Model weights require adherence to the MiniCPM Model Commercial License Agreement, with free commercial use granted after registration via a questionnaire. Academic research use is fully open.

Limitations & Caveats

While MiniCPM models are highly efficient, performance can vary based on specific hardware and inference frameworks. The README notes that for non-MiniCPM models, vLLM version 0.2.7 is used, while MiniCPM implementations are based on vLLM 0.2.2, suggesting potential compatibility nuances.

Health Check
Last Commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
5
Issues (30d)
7
Star History
173 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
9 more.

LightLLM by ModelTC

0.5%
4k
Python framework for LLM inference and serving
Created 2 years ago
Updated 12 hours ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 22 hours ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.