mini-infer  by psmarter

LLM inference engine built from scratch

Created 5 months ago
269 stars

Top 95.5% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

psmarter/mini-infer is an LLM inference engine built from scratch, meticulously implementing and benchmarking advanced optimization techniques. It targets engineers and researchers seeking deep insights into inference performance, offering a foundational engine with an OpenAI-compatible serving layer that achieves high throughput and efficiency.

How It Works

The engine employs a modular design, integrating key optimizations like PagedAttention (via flash_attn block tables), Continuous Batching, Chunked Prefill, Prefix Caching, Speculative Decoding, CUDA Graphs, and Tensor Parallelism. This from-scratch approach allows for precise measurement and understanding of each mechanism's impact, aiming to match baseline performance while pushing efficiency boundaries.

Quick Start & Requirements

  • Install: pip install -e ".[serve,dev]"
  • Run: mini-infer-serve --dry-run --port 8000 (verification) or mini-infer-serve --model /path/to/model --port 8000 (with model).
  • Prerequisites: Python 3.10+, PyTorch 2.1.2+cu121, transformers 4.43.4, flash-attn 2.5.9.post1 (block_size multiple of 256), CUDA 12.1. Benchmarks often use RTX 4090.
  • Docs: architecture.md, benchmarks.md, faq.md, roadmap.md.

Highlighted Details

  • Core serving path (PagedAttention + Continuous Batching + OpenAI API) achieves 100% HF Transformers throughput on Qwen2.5-7B.
  • Concurrent HTTP throughput scales 3.9x (1->8 clients, 55.7->219.1 tok/s) on RTX 4090.
  • PagedAttention matches HF baseline throughput at batch=8.
  • Chunked Prefill reduces ITL spikes by 57%-67%.
  • Prefix Caching cuts TTFT by 22%.
  • Speculative Decoding yields a 55.85% acceptance rate.
  • CUDA Graph reduces decode latency by 28.9%.
  • Flash Decoding offers 3.31x latency improvement for seq=4096.
  • TP=2 greedy output matches single-card, useful for large models.

Maintenance & Community

The project includes a roadmap (docs/roadmap.md) and extensive test suites. Specific community channels or contributor details are not detailed in the provided README.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

This is a prototype implementation focused on correctness and benchmarking individual mechanisms. Tensor Parallelism (TP=2) on a 1.5B model shows communication overhead, making it suitable for memory scaling rather than speedup. Model coverage is limited to Qwen2.5/DeepSeek-V2, unlike vLLM's broader support, and its scheduler is manually implemented, lacking vLLM's advanced SLO and KV-aware features.

Health Check
Last Commit

1 month ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
6
Star History
69 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.4%
7k
LLM inference engine for blazing fast performance
Created 2 years ago
Updated 1 day ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
62 more.

vllm by vllm-project

0.5%
82k
LLM serving engine for high-throughput, memory-efficient inference
Created 3 years ago
Updated 12 hours ago
Feedback? Help us improve.