simple-llm  by naklecha

Minimal, extensible LLM inference engine

Created 2 weeks ago

New!

397 stars

Top 72.9% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

SimpleLLM is a minimal (~950 lines), extensible LLM inference engine built from scratch, targeting researchers, students, and developers needing a readable foundation for experimenting with state-of-the-art inference techniques. It offers a performant starting point for modifying core components, demonstrating the viability of building such systems from the ground up.

How It Works

The engine employs an asynchronous, continuous batching architecture designed to maximize GPU throughput by keeping the hardware saturated. Key optimizations include CUDA graphs for eliminating kernel launch overhead during decode steps, a slot-based KV cache for zero-copy sequence management, and fused Triton kernels for operations like QKV projections, RMSNorm, and RoPE, reducing memory bandwidth requirements. It integrates Flash Attention 2 for memory-efficient attention computation and Grouped Query Attention (GQA) for faster decoding, all within a highly readable codebase (~760 lines for core inference logic) that facilitates modification and extension.

Quick Start & Requirements

  • Prerequisites: Python 3.12+, NVIDIA GPU with CUDA 12.8+.
  • Installation: Execute ./setup.sh and activate the environment (source ./venv/bin/activate).
  • Usage: Instantiate LLM with the model path and call generate.
  • Documentation: Codebase itself serves as primary documentation.

Highlighted Details

  • Performance: Achieves high throughput on a single NVIDIA H100 80GB, demonstrating competitive performance with vLLM at batch size 1 (135 tok/s vs 138 tok/s) and exceeding it at batch size 64 (4,041 tok/s vs 3,846 tok/s), highlighting the effectiveness of its batching and kernel optimizations.
  • Codebase: Minimalist design (~950 lines total, ~563 for the engine) prioritizes readability and extensibility, making it an ideal starting point for researchers and students to understand and modify state-of-the-art inference techniques.
  • Advanced Features: Implements cutting-edge techniques such as continuous batching, CUDA graphs, quantized Mixture-of-Experts (MoE) support, Flash Attention 2, and Grouped Query Attention (GQA).
  • Model Support: Currently supports the OpenAI/gpt-oss-120b model, serving as a concrete example for its capabilities.

Maintenance & Community

No specific details regarding community channels, active contributors, or roadmap were provided in the README.

Licensing & Compatibility

  • License: Apache 2.0.
  • Compatibility: Permissive license generally suitable for commercial use, though specific integration details are not elaborated.

Limitations & Caveats

The engine is currently restricted to a single NVIDIA H100 GPU and the OpenAI/gpt-oss-120b model. Features like paged attention and multi-GPU tensor parallelism are planned for future implementation.

Health Check
Last Commit

2 weeks ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
402 stars in the last 19 days

Explore Similar Projects

Starred by Edward Sun Edward Sun(Research Scientist at Meta Superintelligence Lab), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
4 more.

batch_invariant_ops by thinking-machines-lab

0.1%
951
Enhance LLM inference determinism
Created 4 months ago
Updated 2 months ago
Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), Nikola Borisov Nikola Borisov(Founder and CEO of DeepInfra), and
3 more.

tensorrtllm_backend by triton-inference-server

0.4%
918
Triton backend for serving TensorRT-LLM models
Created 2 years ago
Updated 1 day ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

0.8%
2k
Speculative decoding research paper for faster LLM inference
Created 2 years ago
Updated 2 weeks ago
Feedback? Help us improve.