r1-ktransformers-guide by ubergarm

Local inference for large language models using ktransformers

Created 11 months ago

259 stars

Top 97.9% on SourcePulse

View on GitHub

1 Expert Loves This Project

Michael Han

Cofounder of Unsloth

Project Summary

This repository provides a guide for running DeepSeek-R1 671B models using ktransformers, a framework designed for faster local inferencing on systems with CUDA-enabled GPUs. It targets users looking to achieve significant performance gains over traditional solutions like llama.cpp, particularly on hardware with substantial RAM and VRAM.

How It Works

Ktransformers leverages optimizations such as Flash Attention and Multi-head Latent Attention (MLA) to accelerate inference. It supports various quantization methods for GGUF models, allowing users to balance model quality with resource requirements. The framework also explores experimental features like selective layer offloading and CUDA Graphs for further performance tuning, though some advanced features may conflict or require specific configurations.

Quick Start & Requirements

Installation: Clone the repository, initialize submodules, checkout a specific commit, and install dependencies using uv. Pre-built wheels are available for easier installation.
Prerequisites: NVIDIA Driver Version 570.86.1x, CUDA Version 12.8, Python 3.11. uv is recommended for package management.
Model Download: Use huggingface-cli download to fetch GGUF models.
Running: Start the local chat API endpoint using python3 ktransformers/server/main.py with various command-line arguments for model path, quantization, and optimization.
Documentation: https://docs.astral.sh/uv/getting-started/installation/

Highlighted Details

Claims ~2x faster performance than llama.cpp on systems with a 16GB+ VRAM CUDA GPU and sufficient RAM.
Supports memory-mapped files (mmap) for offloading to fast NVMe drives when RAM is insufficient.
Benchmarks show ktransformers achieving significantly higher tokens/sec compared to llama.cpp, especially when CUDA Graphs are enabled.
Explores optimizations like quantizing KV cache and potential context-shift capabilities.

Maintenance & Community

The project is actively developed, with specific commit hashes referenced for stability.
Discussions and further benchmarks can be found on the Level1Techs forum.
Links to related discussions on NUMA nodes and the ktransformers GitHub repository are provided.

Licensing & Compatibility

The repository itself does not explicitly state a license in the README.
Dependencies like llama.cpp and flash-attention have their own licenses.
Compatibility for commercial use is not specified.

Limitations & Caveats

The ktransformers framework is described as "rough around the edges" and not production-ready. Some experimental features, like Flashinfer, are not recommended for general users. Building from source may encounter errors with newer NVIDIA toolkits (nvcc). The interaction between CUDA Graphs and layer offloading requires careful configuration to avoid performance degradation.

Health Check

Last Commit

10 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days