ktransformers by kvcache-ai

Framework for LLM inference optimization experimentation

Created 1 year ago

16,332 stars

Top 3.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Patrick von Platen

Author of Hugging Face Diffusers; Research Engineer at Mistral

Junyang Lin

Core Maintainer at Alibaba Qwen

Michael Han

Cofounder of Unsloth

and 2 more!

Project Summary

KTransformers is a Python framework designed to accelerate Large Language Model (LLM) inference on resource-constrained local machines by injecting cutting-edge kernel optimizations and parallelism strategies. It targets researchers and power users seeking to experiment with and deploy LLMs efficiently, offering compatibility with Hugging Face Transformers, OpenAI/Ollama APIs, and a ChatGPT-like UI.

How It Works

KTransformers employs a template-based injection system to replace standard PyTorch modules with optimized variants. This approach allows for flexible integration of advanced kernels (e.g., Marlin, Llamafile) and quantization techniques (e.g., 4-bit, FP8) with minimal code changes. The framework prioritizes heterogeneous computing, enabling GPU/CPU offloading and combining multiple optimizations for synergistic performance gains, particularly beneficial for local deployments.

Quick Start & Requirements

Installation: Follow the official Installation Guide.
Prerequisites: PyTorch, Hugging Face Transformers. Specific optimizations may require CUDA or ROCm.
Usage: Utilize a YAML-based injection template to specify module replacements before loading models with optimize_and_load_gguf.
Resources: Detailed tutorials and examples are available for specific models and optimizations.

Highlighted Details

Achieves significant speedups (up to 27.79x prefill, 3.03x decode) compared to llama.cpp for models like DeepSeek-Coder-V2/V3.
Supports running large models (e.g., 671B DeepSeek-Coder-V3 Q4_K_M) on consumer hardware (e.g., 14GB VRAM, 382GB DRAM).
Offers OpenAI and Ollama compatible APIs for seamless integration with existing LLM frontends.
Features experimental support for AMX-Int8/BF16, ROCm on AMD GPUs, and longer context windows.

Maintenance & Community

Actively developed by contributors from Tsinghua University (MADSys group) and Approaching.AI. Discussions are encouraged via GitHub issues; a WeChat group is also available.

Licensing & Compatibility

The README does not explicitly state the license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Some advanced optimizations, like AMX kernels, are currently only available in preview binary distributions, not fully open-sourced yet. The project is actively evolving, with features like AMX optimizations and selective expert activation planned for future open-source release (V0.3).

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

181 stars in the last 30 days