r1-ktransformers-guide  by ubergarm

Local inference for large language models using ktransformers

Created 8 months ago
252 stars

Top 99.5% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

This repository provides a guide for running DeepSeek-R1 671B models using ktransformers, a framework designed for faster local inferencing on systems with CUDA-enabled GPUs. It targets users looking to achieve significant performance gains over traditional solutions like llama.cpp, particularly on hardware with substantial RAM and VRAM.

How It Works

Ktransformers leverages optimizations such as Flash Attention and Multi-head Latent Attention (MLA) to accelerate inference. It supports various quantization methods for GGUF models, allowing users to balance model quality with resource requirements. The framework also explores experimental features like selective layer offloading and CUDA Graphs for further performance tuning, though some advanced features may conflict or require specific configurations.

Quick Start & Requirements

  • Installation: Clone the repository, initialize submodules, checkout a specific commit, and install dependencies using uv. Pre-built wheels are available for easier installation.
  • Prerequisites: NVIDIA Driver Version 570.86.1x, CUDA Version 12.8, Python 3.11. uv is recommended for package management.
  • Model Download: Use huggingface-cli download to fetch GGUF models.
  • Running: Start the local chat API endpoint using python3 ktransformers/server/main.py with various command-line arguments for model path, quantization, and optimization.
  • Documentation: https://docs.astral.sh/uv/getting-started/installation/

Highlighted Details

  • Claims ~2x faster performance than llama.cpp on systems with a 16GB+ VRAM CUDA GPU and sufficient RAM.
  • Supports memory-mapped files (mmap) for offloading to fast NVMe drives when RAM is insufficient.
  • Benchmarks show ktransformers achieving significantly higher tokens/sec compared to llama.cpp, especially when CUDA Graphs are enabled.
  • Explores optimizations like quantizing KV cache and potential context-shift capabilities.

Maintenance & Community

  • The project is actively developed, with specific commit hashes referenced for stability.
  • Discussions and further benchmarks can be found on the Level1Techs forum.
  • Links to related discussions on NUMA nodes and the ktransformers GitHub repository are provided.

Licensing & Compatibility

  • The repository itself does not explicitly state a license in the README.
  • Dependencies like llama.cpp and flash-attention have their own licenses.
  • Compatibility for commercial use is not specified.

Limitations & Caveats

The ktransformers framework is described as "rough around the edges" and not production-ready. Some experimental features, like Flashinfer, are not recommended for general users. Building from source may encounter errors with newer NVIDIA toolkits (nvcc). The interaction between CUDA Graphs and layer offloading requires careful configuration to avoid performance degradation.

Health Check
Last Commit

7 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Starred by Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI).

dots.llm1 by rednote-hilab

0.2%
466
MoE model for research
Created 5 months ago
Updated 1 month ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Yineng Zhang Yineng Zhang(Inference Lead at SGLang; Research Scientist at Together AI), and
8 more.

EAGLE by SafeAILab

1.3%
2k
Speculative decoding research paper for faster LLM inference
Created 1 year ago
Updated 2 days ago
Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
12 more.

mistral.rs by EricLBuehler

0.1%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 4 days ago
Feedback? Help us improve.