qwen600 by yassa9

Static, single-batch CUDA inference engine for QWEN3-0.6B

Created 4 months ago

539 stars

Top 59.0% on SourcePulse

Project Summary

yassa9/qwen600 is a static, suckless, single-batch inference engine for the QWEN3-0.6B LLM, built entirely in CUDA C/C++. It targets developers and researchers seeking a minimalist, high-performance, and educational tool for understanding LLM inference on NVIDIA GPUs, offering significant speedups over popular alternatives.

How It Works

Adhering to the suckless philosophy, qwen600 prioritizes simplicity, minimal dependencies (cuBLAS, CUB), and compile-time optimizations. It features a CUDA-only C/C++ codebase, eliminating Python runtime dependencies for inference. Key design elements include static constants, an efficient memory pipeline leveraging mmap and async copies, and zero-cost pointer-based weight management on the GPU. This approach yields a highly performant and resource-efficient inference solution.

Quick Start & Requirements

Installation: Clone the QWEN3-0.6B model weights, verify checksums, clone the qwen600 repository, convert the Hugging Face tokenizer using python export.py <model_dir>, build with mkdir build && cd build && cmake .. && make -j$(nproc).
Requirements: CUDA toolkit (>= 13.0 mentioned in benchmarks), nvcc, cuBLAS, CUB, Python (for tokenizer export), sufficient NVIDIA GPU VRAM (RTX 3050 8GB mentioned).
Resources: Requires model weights and build tools. Setup involves cloning, conversion, and compilation.
Docs: Project repository: https://github.com/yassa9/qwen600.

Highlighted Details

Performance Claims: Benchmarks on an RTX 3050 8GB show qwen600 achieving ~116 tk/s, outperforming llama.cpp (~107 tk/s) by ~8.5% and hf + flash-attn (~29 tk/s) by ~292%.
Suckless Design: Minimalist C/C++ codebase, configuration via config.h, reduced dependencies.
CUDA-Native: Fully CUDA C/C++ implementation for inference, avoiding Python overhead.
Memory Management: Employs mmap, single GPU block allocation, and asynchronous copy operations.

Maintenance & Community

Status: Described as an educational project.
TODOs: Includes fixing the Softmax Kernel & Dispatcher and exploring RoPE pre-computed values.
Community: No specific community links (Discord, Slack) or contributor information are provided in the README.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

Scope: Primarily an educational tool, not optimized for production deployment or broad hardware support beyond NVIDIA GPUs.
Features: Supports single-batch inference only. Ongoing development includes fixing specific kernels and exploring optimizations.

Health Check

Last Commit

4 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

11 stars in the last 30 days