mini-sglang by sgl-project

Lightweight LLM inference framework with advanced optimizations

Created 5 months ago

3,541 stars

Top 13.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Coauthor of SGLang, vLLM

Ying Sheng

Coauthor of SGLang

Project Summary

Mini-SGLang provides a lightweight, high-performance inference framework for Large Language Models (LLMs), serving as a compact (~5,000 lines of Python) and transparent implementation of SGLang. It targets researchers and developers aiming to demystify complex LLM serving systems while achieving state-of-the-art throughput and latency.

How It Works

The framework employs advanced optimizations for efficient LLM serving. Key techniques include Radix Cache for KV cache reuse across requests, Chunked Prefill to reduce peak memory usage for long contexts, and Overlap Scheduling to hide CPU scheduling overhead with GPU computation. It integrates highly optimized kernels like FlashAttention and FlashInfer, and supports Tensor Parallelism for scaling inference across multiple GPUs.

Quick Start & Requirements

Installation: Recommended environment setup uses uv with Python 3.10+ (3.12 shown in example). Clone the repository (git clone https://github.com/sgl-project/mini-sglang.git), navigate into the directory, and install using uv pip install -e ..
Prerequisites: Requires the NVIDIA CUDA Toolkit, as Mini-SGLang relies on JIT-compiled CUDA kernels. Ensure the toolkit version matches your driver's CUDA capability (check with nvidia-smi).
Usage: Launch an OpenAI-compatible API server with python -m minisgl --model "MODEL_NAME". Options include Tensor Parallelism (--tp) and an interactive shell (--shell).
Documentation: Links to "Detailed Features" and "System Architecture" are mentioned but not provided.

Highlighted Details

Achieves state-of-the-art throughput and latency.
Codebase is approximately 5,000 lines of Python, designed for readability and ease of modification.
Features advanced optimizations: Radix Cache, Chunked Prefill, and Overlap Scheduling.
Integrates optimized kernels (FlashAttention, FlashInfer) and supports Tensor Parallelism.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The license type is not explicitly stated in the README. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not detail known limitations, alpha status, or specific bugs. The absence of an explicit license is a significant adoption blocker.

Health Check

Last Commit

21 hours ago

Responsiveness

Inactive

Pull Requests (30d)