qwen600  by yassa9

Static, single-batch CUDA inference engine for QWEN3-0.6B

Created 1 week ago

New!

456 stars

Top 66.4% on SourcePulse

GitHubView on GitHub
Project Summary

yassa9/qwen600 is a static, suckless, single-batch inference engine for the QWEN3-0.6B LLM, built entirely in CUDA C/C++. It targets developers and researchers seeking a minimalist, high-performance, and educational tool for understanding LLM inference on NVIDIA GPUs, offering significant speedups over popular alternatives.

How It Works

Adhering to the suckless philosophy, qwen600 prioritizes simplicity, minimal dependencies (cuBLAS, CUB), and compile-time optimizations. It features a CUDA-only C/C++ codebase, eliminating Python runtime dependencies for inference. Key design elements include static constants, an efficient memory pipeline leveraging mmap and async copies, and zero-cost pointer-based weight management on the GPU. This approach yields a highly performant and resource-efficient inference solution.

Quick Start & Requirements

  • Installation: Clone the QWEN3-0.6B model weights, verify checksums, clone the qwen600 repository, convert the Hugging Face tokenizer using python export.py <model_dir>, build with mkdir build && cd build && cmake .. && make -j$(nproc).
  • Requirements: CUDA toolkit (>= 13.0 mentioned in benchmarks), nvcc, cuBLAS, CUB, Python (for tokenizer export), sufficient NVIDIA GPU VRAM (RTX 3050 8GB mentioned).
  • Resources: Requires model weights and build tools. Setup involves cloning, conversion, and compilation.
  • Docs: Project repository: https://github.com/yassa9/qwen600.

Highlighted Details

  • Performance Claims: Benchmarks on an RTX 3050 8GB show qwen600 achieving ~116 tk/s, outperforming llama.cpp (~107 tk/s) by ~8.5% and hf + flash-attn (~29 tk/s) by ~292%.
  • Suckless Design: Minimalist C/C++ codebase, configuration via config.h, reduced dependencies.
  • CUDA-Native: Fully CUDA C/C++ implementation for inference, avoiding Python overhead.
  • Memory Management: Employs mmap, single GPU block allocation, and asynchronous copy operations.

Maintenance & Community

  • Status: Described as an educational project.
  • TODOs: Includes fixing the Softmax Kernel & Dispatcher and exploring RoPE pre-computed values.
  • Community: No specific community links (Discord, Slack) or contributor information are provided in the README.

Licensing & Compatibility

  • License: MIT License.
  • Compatibility: Permissive license suitable for commercial use and integration into closed-source projects.

Limitations & Caveats

  • Scope: Primarily an educational tool, not optimized for production deployment or broad hardware support beyond NVIDIA GPUs.
  • Features: Supports single-batch inference only. Ongoing development includes fixing specific kernels and exploring optimizations.
Health Check
Last Commit

1 week ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
1
Star History
461 stars in the last 13 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 19 hours ago
Feedback? Help us improve.