simple-llm by naklecha

Minimal, extensible LLM inference engine

Created 2 months ago

458 stars

Top 66.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Jeremy Howard

Cofounder of fast.ai

Project Summary

SimpleLLM is a minimal (~950 lines), extensible LLM inference engine built from scratch, targeting researchers, students, and developers needing a readable foundation for experimenting with state-of-the-art inference techniques. It offers a performant starting point for modifying core components, demonstrating the viability of building such systems from the ground up.

How It Works

The engine employs an asynchronous, continuous batching architecture designed to maximize GPU throughput by keeping the hardware saturated. Key optimizations include CUDA graphs for eliminating kernel launch overhead during decode steps, a slot-based KV cache for zero-copy sequence management, and fused Triton kernels for operations like QKV projections, RMSNorm, and RoPE, reducing memory bandwidth requirements. It integrates Flash Attention 2 for memory-efficient attention computation and Grouped Query Attention (GQA) for faster decoding, all within a highly readable codebase (~760 lines for core inference logic) that facilitates modification and extension.

Quick Start & Requirements

Prerequisites: Python 3.12+, NVIDIA GPU with CUDA 12.8+.
Installation: Execute ./setup.sh and activate the environment (source ./venv/bin/activate).
Usage: Instantiate LLM with the model path and call generate.
Documentation: Codebase itself serves as primary documentation.

Highlighted Details

Performance: Achieves high throughput on a single NVIDIA H100 80GB, demonstrating competitive performance with vLLM at batch size 1 (135 tok/s vs 138 tok/s) and exceeding it at batch size 64 (4,041 tok/s vs 3,846 tok/s), highlighting the effectiveness of its batching and kernel optimizations.
Codebase: Minimalist design (~950 lines total, ~563 for the engine) prioritizes readability and extensibility, making it an ideal starting point for researchers and students to understand and modify state-of-the-art inference techniques.
Advanced Features: Implements cutting-edge techniques such as continuous batching, CUDA graphs, quantized Mixture-of-Experts (MoE) support, Flash Attention 2, and Grouped Query Attention (GQA).
Model Support: Currently supports the OpenAI/gpt-oss-120b model, serving as a concrete example for its capabilities.

Maintenance & Community

No specific details regarding community channels, active contributors, or roadmap were provided in the README.

Licensing & Compatibility

License: Apache 2.0.
Compatibility: Permissive license generally suitable for commercial use, though specific integration details are not elaborated.

Limitations & Caveats

The engine is currently restricted to a single NVIDIA H100 GPU and the OpenAI/gpt-oss-120b model. Features like paged attention and multi-GPU tensor parallelism are planned for future implementation.

Health Check

Last Commit

2 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

57 stars in the last 30 days