ik_llama.cpp  by ikawrakow

`llama.cpp` fork for improved CPU/GPU performance

Created 1 year ago
1,179 stars

Top 32.9% on SourcePulse

GitHubView on GitHub
Project Summary

This repository is a fork of llama.cpp focused on enhancing CPU and hybrid GPU/CPU inference performance for large language models. It targets users seeking optimized inference speeds, particularly with advanced quantization techniques and support for newer model architectures like Bitnet and DeepSeek. The primary benefit is significantly faster inference on consumer hardware through specialized optimizations.

How It Works

The project implements several novel techniques to boost performance. Key among these are "FlashMLA" (MLA with Flash Attention) for CPU and CUDA, fused operations for Mixture-of-Experts (MoE) models, and tensor overrides allowing explicit control over weight placement (CPU vs. GPU). It also introduces state-of-the-art quantization types (e.g., IQ1_M, IQ2_XS) and row-interleaved quant packing, reducing memory bandwidth and compute requirements.

Quick Start & Requirements

  • Install: Typically built from source using make or CMake.
  • Prerequisites: C++ compiler (GCC/Clang), CMake, Python (for some scripts), potentially CUDA/cuBLAS for GPU acceleration. Specific quantization types may have unique build requirements.
  • Resources: Performance gains are most pronounced on CPUs with AVX2/AVX512 support and GPUs with CUDA.
  • Links: Wiki for performance comparisons, specific guides for DeepSeek models.

Highlighted Details

  • First-class Bitnet support.
  • Optimized performance for DeepSeek models via MLA, FlashMLA, and fused MoE operations.
  • Introduction of new quantization types (IQ1_M, IQ2_XS, Q8_KV) and custom quantization mixes.
  • Android support via Termux.

Maintenance & Community

The project is actively developed with frequent updates listed in the README. Contributions are welcomed via pull requests and issue submissions.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README emphasizes that detailed information is often found within individual pull requests rather than a single comprehensive document, requiring users to browse PRs for full feature understanding.

Health Check
Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)
44
Issues (30d)
39
Star History
111 stars in the last 30 days

Explore Similar Projects

Starred by Jeremy Howard Jeremy Howard(Cofounder of fast.ai), Sasha Rush Sasha Rush(Research Scientist at Cursor; Professor at Cornell Tech), and
1 more.

GPTQ-triton by fpgaminer

0%
307
Triton kernel for GPTQ inference, improving context scaling
Created 2 years ago
Updated 2 years ago
Starred by Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

AQLM by Vahe1994

0.4%
1k
PyTorch code for LLM compression via Additive Quantization (AQLM)
Created 1 year ago
Updated 1 month ago
Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera), and
4 more.

gemma_pytorch by google

0.2%
6k
PyTorch implementation for Google's Gemma models
Created 1 year ago
Updated 3 months ago
Feedback? Help us improve.