r1-ktransformers-guide  by ubergarm

Local inference for large language models using ktransformers

created 6 months ago
250 stars

Top 100.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a guide for running DeepSeek-R1 671B models using ktransformers, a framework designed for faster local inferencing on systems with CUDA-enabled GPUs. It targets users looking to achieve significant performance gains over traditional solutions like llama.cpp, particularly on hardware with substantial RAM and VRAM.

How It Works

Ktransformers leverages optimizations such as Flash Attention and Multi-head Latent Attention (MLA) to accelerate inference. It supports various quantization methods for GGUF models, allowing users to balance model quality with resource requirements. The framework also explores experimental features like selective layer offloading and CUDA Graphs for further performance tuning, though some advanced features may conflict or require specific configurations.

Quick Start & Requirements

  • Installation: Clone the repository, initialize submodules, checkout a specific commit, and install dependencies using uv. Pre-built wheels are available for easier installation.
  • Prerequisites: NVIDIA Driver Version 570.86.1x, CUDA Version 12.8, Python 3.11. uv is recommended for package management.
  • Model Download: Use huggingface-cli download to fetch GGUF models.
  • Running: Start the local chat API endpoint using python3 ktransformers/server/main.py with various command-line arguments for model path, quantization, and optimization.
  • Documentation: https://docs.astral.sh/uv/getting-started/installation/

Highlighted Details

  • Claims ~2x faster performance than llama.cpp on systems with a 16GB+ VRAM CUDA GPU and sufficient RAM.
  • Supports memory-mapped files (mmap) for offloading to fast NVMe drives when RAM is insufficient.
  • Benchmarks show ktransformers achieving significantly higher tokens/sec compared to llama.cpp, especially when CUDA Graphs are enabled.
  • Explores optimizations like quantizing KV cache and potential context-shift capabilities.

Maintenance & Community

  • The project is actively developed, with specific commit hashes referenced for stability.
  • Discussions and further benchmarks can be found on the Level1Techs forum.
  • Links to related discussions on NUMA nodes and the ktransformers GitHub repository are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README.
  • Dependencies like llama.cpp and flash-attention have their own licenses.
  • Compatibility for commercial use is not specified.

Limitations & Caveats

The ktransformers framework is described as "rough around the edges" and not production-ready. Some experimental features, like Flashinfer, are not recommended for general users. Building from source may encounter errors with newer NVIDIA toolkits (nvcc). The interaction between CUDA Graphs and layer offloading requires careful configuration to avoid performance degradation.

Health Check
Last commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
1
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.6%
3k
High-performance 4-bit diffusion model inference engine
created 9 months ago
updated 1 day ago
Starred by Patrick von Platen Patrick von Platen(Research Engineer at Mistral; Author of Hugging Face Diffusers), Junyang Lin Junyang Lin(Core Maintainer of Alibaba Qwen), and
2 more.

ktransformers by kvcache-ai

0.3%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 weeks ago
Feedback? Help us improve.