ktransformers  by kvcache-ai

Framework for LLM inference optimization experimentation

Created 1 year ago
15,067 stars

Top 3.3% on SourcePulse

GitHubView on GitHub
Project Summary

KTransformers is a Python framework designed to accelerate Large Language Model (LLM) inference on resource-constrained local machines by injecting cutting-edge kernel optimizations and parallelism strategies. It targets researchers and power users seeking to experiment with and deploy LLMs efficiently, offering compatibility with Hugging Face Transformers, OpenAI/Ollama APIs, and a ChatGPT-like UI.

How It Works

KTransformers employs a template-based injection system to replace standard PyTorch modules with optimized variants. This approach allows for flexible integration of advanced kernels (e.g., Marlin, Llamafile) and quantization techniques (e.g., 4-bit, FP8) with minimal code changes. The framework prioritizes heterogeneous computing, enabling GPU/CPU offloading and combining multiple optimizations for synergistic performance gains, particularly beneficial for local deployments.

Quick Start & Requirements

  • Installation: Follow the official Installation Guide.
  • Prerequisites: PyTorch, Hugging Face Transformers. Specific optimizations may require CUDA or ROCm.
  • Usage: Utilize a YAML-based injection template to specify module replacements before loading models with optimize_and_load_gguf.
  • Resources: Detailed tutorials and examples are available for specific models and optimizations.

Highlighted Details

  • Achieves significant speedups (up to 27.79x prefill, 3.03x decode) compared to llama.cpp for models like DeepSeek-Coder-V2/V3.
  • Supports running large models (e.g., 671B DeepSeek-Coder-V3 Q4_K_M) on consumer hardware (e.g., 14GB VRAM, 382GB DRAM).
  • Offers OpenAI and Ollama compatible APIs for seamless integration with existing LLM frontends.
  • Features experimental support for AMX-Int8/BF16, ROCm on AMD GPUs, and longer context windows.

Maintenance & Community

Actively developed by contributors from Tsinghua University (MADSys group) and Approaching.AI. Discussions are encouraged via GitHub issues; a WeChat group is also available.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Some advanced optimizations, like AMX kernels, are currently only available in preview binary distributions, not fully open-sourced yet. The project is actively evolving, with features like AMX optimizations and selective expert activation planned for future open-source release (V0.3).

Health Check
Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)
8
Issues (30d)
20
Star History
218 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.3%
4k
AI inference pipeline framework
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.