llama.cpp by ggml-org

C/C++ library for local LLM inference

Created 2 years ago

92,763 stars

Top 0.1% on SourcePulse

View on GitHub

56 Experts Love This Project

Andrej Karpathy

Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n

Author of LLaMA-Factory

and 52 more!

Project Summary

llama.cpp is a C/C++ library and toolset for efficient Large Language Model (LLM) inference, targeting a wide range of hardware from consumer CPUs to high-end GPUs. It enables local, on-device LLM execution with minimal dependencies and state-of-the-art performance, making advanced AI accessible to developers and researchers.

How It Works

The project leverages the ggml tensor library for its core operations, enabling efficient computation on various hardware backends. It supports extensive quantization (1.5-bit to 8-bit) to reduce memory footprint and accelerate inference. Key optimizations include ARM NEON, Accelerate, and Metal for Apple Silicon, AVX/AVX2/AVX512/AMX for x86, and custom CUDA/HIP kernels for NVIDIA/AMD GPUs. It also offers Vulkan and SYCL backends, plus CPU+GPU hybrid inference for models exceeding VRAM.

Quick Start & Requirements

Install: Build from source (CMake) or use pre-built binaries from releases. Docker images are also available.
Prerequisites: C++ compiler (GCC, Clang), CMake. Optional: CUDA, ROCm, Metal, Vulkan SDK, OpenCL, SYCL, BLAS libraries depending on desired backend.
Models: Requires models in GGUF format. Conversion scripts are provided.
Docs: llama.cpp Documentation

Highlighted Details

Supports over 100 LLM architectures, including LLaMA, Mistral, Mixtral, DBRX, Falcon, Gemma, Mamba, and many more.
Offers an OpenAI-compatible HTTP server (llama-server) for easy integration.
Provides command-line tools (llama-cli, llama-perplexity, llama-bench) for direct interaction and performance analysis.
Extensive community bindings available for Python, Go, Node.js, Rust, C#, Swift, Java, and more.

Maintenance & Community

The project is actively maintained with a large and vibrant community. Notable contributions and integrations include bindings for numerous languages and frameworks, as well as UIs like LMStudio and LocalAI.

Licensing & Compatibility

The project is primarily licensed under the MIT License, allowing for broad commercial and closed-source use. Some associated tools or UIs might have different licenses (e.g., AGPL, proprietary).

Limitations & Caveats

While highly optimized, performance can vary significantly based on hardware, model size, and quantization level. Some advanced features or newer model architectures might require specific build flags or recent commits. The project is under continuous development, and breaking API changes can occur.

Health Check

Last Commit

9 hours ago

Responsiveness

1 day

Pull Requests (30d)

501

Issues (30d)

407

Star History

1,870 stars in the last 30 days