ktransformers  by kvcache-ai

Framework for LLM inference optimization experimentation

created 1 year ago
14,750 stars

Top 3.5% on sourcepulse

GitHubView on GitHub
Project Summary

KTransformers is a Python framework designed to accelerate Large Language Model (LLM) inference on resource-constrained local machines by injecting cutting-edge kernel optimizations and parallelism strategies. It targets researchers and power users seeking to experiment with and deploy LLMs efficiently, offering compatibility with Hugging Face Transformers, OpenAI/Ollama APIs, and a ChatGPT-like UI.

How It Works

KTransformers employs a template-based injection system to replace standard PyTorch modules with optimized variants. This approach allows for flexible integration of advanced kernels (e.g., Marlin, Llamafile) and quantization techniques (e.g., 4-bit, FP8) with minimal code changes. The framework prioritizes heterogeneous computing, enabling GPU/CPU offloading and combining multiple optimizations for synergistic performance gains, particularly beneficial for local deployments.

Quick Start & Requirements

  • Installation: Follow the official Installation Guide.
  • Prerequisites: PyTorch, Hugging Face Transformers. Specific optimizations may require CUDA or ROCm.
  • Usage: Utilize a YAML-based injection template to specify module replacements before loading models with optimize_and_load_gguf.
  • Resources: Detailed tutorials and examples are available for specific models and optimizations.

Highlighted Details

  • Achieves significant speedups (up to 27.79x prefill, 3.03x decode) compared to llama.cpp for models like DeepSeek-Coder-V2/V3.
  • Supports running large models (e.g., 671B DeepSeek-Coder-V3 Q4_K_M) on consumer hardware (e.g., 14GB VRAM, 382GB DRAM).
  • Offers OpenAI and Ollama compatible APIs for seamless integration with existing LLM frontends.
  • Features experimental support for AMX-Int8/BF16, ROCm on AMD GPUs, and longer context windows.

Maintenance & Community

Actively developed by contributors from Tsinghua University (MADSys group) and Approaching.AI. Discussions are encouraged via GitHub issues; a WeChat group is also available.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Some advanced optimizations, like AMX kernels, are currently only available in preview binary distributions, not fully open-sourced yet. The project is actively evolving, with features like AMX optimizations and selective expert activation planned for future open-source release (V0.3).

Health Check
Last commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
16
Issues (30d)
28
Star History
991 stars in the last 90 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jiayi Pan Jiayi Pan(Author of SWE-Gym; AI Researcher at UC Berkeley), and
5 more.

Liger-Kernel by linkedin

0.6%
5k
Triton kernels for efficient LLM training
created 1 year ago
updated 1 day ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 18 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.