Discover and explore top open-source AI tools and projects—updated daily.
naklechaMinimal, extensible LLM inference engine
New!
Top 72.9% on SourcePulse
SimpleLLM is a minimal (~950 lines), extensible LLM inference engine built from scratch, targeting researchers, students, and developers needing a readable foundation for experimenting with state-of-the-art inference techniques. It offers a performant starting point for modifying core components, demonstrating the viability of building such systems from the ground up.
How It Works
The engine employs an asynchronous, continuous batching architecture designed to maximize GPU throughput by keeping the hardware saturated. Key optimizations include CUDA graphs for eliminating kernel launch overhead during decode steps, a slot-based KV cache for zero-copy sequence management, and fused Triton kernels for operations like QKV projections, RMSNorm, and RoPE, reducing memory bandwidth requirements. It integrates Flash Attention 2 for memory-efficient attention computation and Grouped Query Attention (GQA) for faster decoding, all within a highly readable codebase (~760 lines for core inference logic) that facilitates modification and extension.
Quick Start & Requirements
./setup.sh and activate the environment (source ./venv/bin/activate).LLM with the model path and call generate.Highlighted Details
Maintenance & Community
No specific details regarding community channels, active contributors, or roadmap were provided in the README.
Licensing & Compatibility
Limitations & Caveats
The engine is currently restricted to a single NVIDIA H100 GPU and the OpenAI/gpt-oss-120b model. Features like paged attention and multi-GPU tensor parallelism are planned for future implementation.
2 weeks ago
Inactive
thinking-machines-lab
triton-inference-server
SafeAILab
Tiiny-AI