ik_llama.cpp  by ikawrakow

`llama.cpp` fork for improved CPU/GPU performance

created 1 year ago
918 stars

Top 40.5% on sourcepulse

GitHubView on GitHub
Project Summary

This repository is a fork of llama.cpp focused on enhancing CPU and hybrid GPU/CPU inference performance for large language models. It targets users seeking optimized inference speeds, particularly with advanced quantization techniques and support for newer model architectures like Bitnet and DeepSeek. The primary benefit is significantly faster inference on consumer hardware through specialized optimizations.

How It Works

The project implements several novel techniques to boost performance. Key among these are "FlashMLA" (MLA with Flash Attention) for CPU and CUDA, fused operations for Mixture-of-Experts (MoE) models, and tensor overrides allowing explicit control over weight placement (CPU vs. GPU). It also introduces state-of-the-art quantization types (e.g., IQ1_M, IQ2_XS) and row-interleaved quant packing, reducing memory bandwidth and compute requirements.

Quick Start & Requirements

  • Install: Typically built from source using make or CMake.
  • Prerequisites: C++ compiler (GCC/Clang), CMake, Python (for some scripts), potentially CUDA/cuBLAS for GPU acceleration. Specific quantization types may have unique build requirements.
  • Resources: Performance gains are most pronounced on CPUs with AVX2/AVX512 support and GPUs with CUDA.
  • Links: Wiki for performance comparisons, specific guides for DeepSeek models.

Highlighted Details

  • First-class Bitnet support.
  • Optimized performance for DeepSeek models via MLA, FlashMLA, and fused MoE operations.
  • Introduction of new quantization types (IQ1_M, IQ2_XS, Q8_KV) and custom quantization mixes.
  • Android support via Termux.

Maintenance & Community

The project is actively developed with frequent updates listed in the README. Contributions are welcomed via pull requests and issue submissions.

Licensing & Compatibility

  • License: MIT.
  • Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README emphasizes that detailed information is often found within individual pull requests rather than a single comprehensive document, requiring users to browse PRs for full feature understanding.

Health Check
Last commit

6 days ago

Responsiveness

1 day

Pull Requests (30d)
46
Issues (30d)
34
Star History
566 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Aravind Srinivas Aravind Srinivas(Cofounder of Perplexity), Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake), and
12 more.

DeepSpeed by deepspeedai

0.2%
40k
Deep learning optimization library for distributed training and inference
created 5 years ago
updated 1 day ago
Feedback? Help us improve.