`llama.cpp` fork for improved CPU/GPU performance
Top 40.5% on sourcepulse
This repository is a fork of llama.cpp
focused on enhancing CPU and hybrid GPU/CPU inference performance for large language models. It targets users seeking optimized inference speeds, particularly with advanced quantization techniques and support for newer model architectures like Bitnet and DeepSeek. The primary benefit is significantly faster inference on consumer hardware through specialized optimizations.
How It Works
The project implements several novel techniques to boost performance. Key among these are "FlashMLA" (MLA with Flash Attention) for CPU and CUDA, fused operations for Mixture-of-Experts (MoE) models, and tensor overrides allowing explicit control over weight placement (CPU vs. GPU). It also introduces state-of-the-art quantization types (e.g., IQ1_M, IQ2_XS) and row-interleaved quant packing, reducing memory bandwidth and compute requirements.
Quick Start & Requirements
make
or CMake.Highlighted Details
Maintenance & Community
The project is actively developed with frequent updates listed in the README. Contributions are welcomed via pull requests and issue submissions.
Licensing & Compatibility
Limitations & Caveats
The README emphasizes that detailed information is often found within individual pull requests rather than a single comprehensive document, requiring users to browse PRs for full feature understanding.
6 days ago
1 day