Discover and explore top open-source AI tools and projects—updated daily.
Static, single-batch CUDA inference engine for QWEN3-0.6B
New!
Top 66.4% on SourcePulse
yassa9/qwen600 is a static, suckless, single-batch inference engine for the QWEN3-0.6B LLM, built entirely in CUDA C/C++. It targets developers and researchers seeking a minimalist, high-performance, and educational tool for understanding LLM inference on NVIDIA GPUs, offering significant speedups over popular alternatives.
How It Works
Adhering to the suckless philosophy, qwen600
prioritizes simplicity, minimal dependencies (cuBLAS, CUB), and compile-time optimizations. It features a CUDA-only C/C++ codebase, eliminating Python runtime dependencies for inference. Key design elements include static constants, an efficient memory pipeline leveraging mmap and async copies, and zero-cost pointer-based weight management on the GPU. This approach yields a highly performant and resource-efficient inference solution.
Quick Start & Requirements
qwen600
repository, convert the Hugging Face tokenizer using python export.py <model_dir>
, build with mkdir build && cd build && cmake .. && make -j$(nproc)
.https://github.com/yassa9/qwen600
.Highlighted Details
qwen600
achieving ~116 tk/s, outperforming llama.cpp
(~107 tk/s) by ~8.5% and hf + flash-attn
(~29 tk/s) by ~292%.config.h
, reduced dependencies.Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
1 week ago
Inactive