mini-sglang  by sgl-project

Lightweight LLM inference framework with advanced optimizations

Created 4 months ago
2,874 stars

Top 16.4% on SourcePulse

GitHubView on GitHub
Project Summary

Mini-SGLang provides a lightweight, high-performance inference framework for Large Language Models (LLMs), serving as a compact (~5,000 lines of Python) and transparent implementation of SGLang. It targets researchers and developers aiming to demystify complex LLM serving systems while achieving state-of-the-art throughput and latency.

How It Works

The framework employs advanced optimizations for efficient LLM serving. Key techniques include Radix Cache for KV cache reuse across requests, Chunked Prefill to reduce peak memory usage for long contexts, and Overlap Scheduling to hide CPU scheduling overhead with GPU computation. It integrates highly optimized kernels like FlashAttention and FlashInfer, and supports Tensor Parallelism for scaling inference across multiple GPUs.

Quick Start & Requirements

  • Installation: Recommended environment setup uses uv with Python 3.10+ (3.12 shown in example). Clone the repository (git clone https://github.com/sgl-project/mini-sglang.git), navigate into the directory, and install using uv pip install -e ..
  • Prerequisites: Requires the NVIDIA CUDA Toolkit, as Mini-SGLang relies on JIT-compiled CUDA kernels. Ensure the toolkit version matches your driver's CUDA capability (check with nvidia-smi).
  • Usage: Launch an OpenAI-compatible API server with python -m minisgl --model "MODEL_NAME". Options include Tensor Parallelism (--tp) and an interactive shell (--shell).
  • Documentation: Links to "Detailed Features" and "System Architecture" are mentioned but not provided.

Highlighted Details

  • Achieves state-of-the-art throughput and latency.
  • Codebase is approximately 5,000 lines of Python, designed for readability and ease of modification.
  • Features advanced optimizations: Radix Cache, Chunked Prefill, and Overlap Scheduling.
  • Integrates optimized kernels (FlashAttention, FlashInfer) and supports Tensor Parallelism.

Maintenance & Community

The provided README does not contain specific details regarding maintainers, community channels (e.g., Discord, Slack), or project roadmaps.

Licensing & Compatibility

The license type is not explicitly stated in the README. This requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The README does not detail known limitations, alpha status, or specific bugs. The absence of an explicit license is a significant adoption blocker.

Health Check
Last Commit

5 days ago

Responsiveness

Inactive

Pull Requests (30d)
42
Issues (30d)
15
Star History
2,900 stars in the last 30 days

Explore Similar Projects

Starred by Ying Sheng Ying Sheng(Coauthor of SGLang) and Stas Bekman Stas Bekman(Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake).

llm-analysis by cli99

0%
475
CLI tool for LLM latency/memory analysis during training/inference
Created 2 years ago
Updated 8 months ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.4%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; MTS at xAI), and
34 more.

flash-attention by Dao-AILab

0.6%
22k
Fast, memory-efficient attention implementation
Created 3 years ago
Updated 1 day ago
Feedback? Help us improve.