Discover and explore top open-source AI tools and projects—updated daily.
psmarterLLM inference engine built from scratch
Top 95.5% on SourcePulse
Summary
psmarter/mini-infer is an LLM inference engine built from scratch, meticulously implementing and benchmarking advanced optimization techniques. It targets engineers and researchers seeking deep insights into inference performance, offering a foundational engine with an OpenAI-compatible serving layer that achieves high throughput and efficiency.
How It Works
The engine employs a modular design, integrating key optimizations like PagedAttention (via flash_attn block tables), Continuous Batching, Chunked Prefill, Prefix Caching, Speculative Decoding, CUDA Graphs, and Tensor Parallelism. This from-scratch approach allows for precise measurement and understanding of each mechanism's impact, aiming to match baseline performance while pushing efficiency boundaries.
Quick Start & Requirements
pip install -e ".[serve,dev]"mini-infer-serve --dry-run --port 8000 (verification) or mini-infer-serve --model /path/to/model --port 8000 (with model).Highlighted Details
Maintenance & Community
The project includes a roadmap (docs/roadmap.md) and extensive test suites. Specific community channels or contributor details are not detailed in the provided README.
Licensing & Compatibility
Limitations & Caveats
This is a prototype implementation focused on correctness and benchmarking individual mechanisms. Tensor Parallelism (TP=2) on a 1.5B model shows communication overhead, making it suitable for memory scaling rather than speedup. Model coverage is limited to Qwen2.5/DeepSeek-V2, unlike vLLM's broader support, and its scheduler is manually implemented, lacking vLLM's advanced SLO and KV-aware features.
1 month ago
Inactive
ScalingIntelligence
EricLBuehler
ai-dynamo
vllm-project