ik_llama.cpp by ikawrakow

`llama.cpp` fork for improved CPU/GPU performance

Created 1 year ago

1,477 stars

Top 27.6% on SourcePulse

Project Summary

This repository is a fork of llama.cpp focused on enhancing CPU and hybrid GPU/CPU inference performance for large language models. It targets users seeking optimized inference speeds, particularly with advanced quantization techniques and support for newer model architectures like Bitnet and DeepSeek. The primary benefit is significantly faster inference on consumer hardware through specialized optimizations.

How It Works

The project implements several novel techniques to boost performance. Key among these are "FlashMLA" (MLA with Flash Attention) for CPU and CUDA, fused operations for Mixture-of-Experts (MoE) models, and tensor overrides allowing explicit control over weight placement (CPU vs. GPU). It also introduces state-of-the-art quantization types (e.g., IQ1_M, IQ2_XS) and row-interleaved quant packing, reducing memory bandwidth and compute requirements.

Quick Start & Requirements

Install: Typically built from source using make or CMake.
Prerequisites: C++ compiler (GCC/Clang), CMake, Python (for some scripts), potentially CUDA/cuBLAS for GPU acceleration. Specific quantization types may have unique build requirements.
Resources: Performance gains are most pronounced on CPUs with AVX2/AVX512 support and GPUs with CUDA.
Links: Wiki for performance comparisons, specific guides for DeepSeek models.

Highlighted Details

First-class Bitnet support.
Optimized performance for DeepSeek models via MLA, FlashMLA, and fused MoE operations.
Introduction of new quantization types (IQ1_M, IQ2_XS, Q8_KV) and custom quantization mixes.
Android support via Termux.

Maintenance & Community

The project is actively developed with frequent updates listed in the README. Contributions are welcomed via pull requests and issue submissions.

Licensing & Compatibility

License: MIT.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The README emphasizes that detailed information is often found within individual pull requests rather than a single comprehensive document, requiring users to browse PRs for full feature understanding.

Health Check

Last Commit

1 day ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

96 stars in the last 30 days