mini-infer by psmarter

LLM inference engine built from scratch

Created 6 months ago

278 stars

Top 93.1% on SourcePulse

Project Summary

Summary

psmarter/mini-infer is an LLM inference engine built from scratch, meticulously implementing and benchmarking advanced optimization techniques. It targets engineers and researchers seeking deep insights into inference performance, offering a foundational engine with an OpenAI-compatible serving layer that achieves high throughput and efficiency.

How It Works

The engine employs a modular design, integrating key optimizations like PagedAttention (via flash_attn block tables), Continuous Batching, Chunked Prefill, Prefix Caching, Speculative Decoding, CUDA Graphs, and Tensor Parallelism. This from-scratch approach allows for precise measurement and understanding of each mechanism's impact, aiming to match baseline performance while pushing efficiency boundaries.

Quick Start & Requirements

Install: pip install -e ".[serve,dev]"
Run: mini-infer-serve --dry-run --port 8000 (verification) or mini-infer-serve --model /path/to/model --port 8000 (with model).
Prerequisites: Python 3.10+, PyTorch 2.1.2+cu121, transformers 4.43.4, flash-attn 2.5.9.post1 (block_size multiple of 256), CUDA 12.1. Benchmarks often use RTX 4090.
Docs: architecture.md, benchmarks.md, faq.md, roadmap.md.

Highlighted Details

Core serving path (PagedAttention + Continuous Batching + OpenAI API) achieves 100% HF Transformers throughput on Qwen2.5-7B.
Concurrent HTTP throughput scales 3.9x (1->8 clients, 55.7->219.1 tok/s) on RTX 4090.
PagedAttention matches HF baseline throughput at batch=8.
Chunked Prefill reduces ITL spikes by 57%-67%.
Prefix Caching cuts TTFT by 22%.
Speculative Decoding yields a 55.85% acceptance rate.
CUDA Graph reduces decode latency by 28.9%.
Flash Decoding offers 3.31x latency improvement for seq=4096.
TP=2 greedy output matches single-card, useful for large models.

Maintenance & Community

The project includes a roadmap (docs/roadmap.md) and extensive test suites. Specific community channels or contributor details are not detailed in the provided README.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

This is a prototype implementation focused on correctness and benchmarking individual mechanisms. Tensor Parallelism (TP=2) on a 1.5B model shows communication overhead, making it suitable for memory scaling rather than speedup. Model coverage is limited to Qwen2.5/DeepSeek-V2, unlike vLLM's broader support, and its scheduler is manually implemented, lacking vLLM's advanced SLO and KV-aware features.

mini-infer by psmarter

Explore Similar Projects

Awesome_LLM_System-PaperList by galeselee

ntransformer by xaskasdf

dash-infer by modelscope

MoE-Infinity by EfficientMoE

tokasaurus by ScalingIntelligence

ScaleLLM by vectorch-ai

xinfer by guoqingbao

ssd by tanishqkumar

candle-vllm by EricLBuehler

mistral.rs by EricLBuehler

dynamo by ai-dynamo

vllm by vllm-project