llama.cpp  by ggml-org

C/C++ library for local LLM inference

created 2 years ago
83,689 stars

Top 0.1% on sourcepulse

GitHubView on GitHub
Project Summary

llama.cpp is a C/C++ library and toolset for efficient Large Language Model (LLM) inference, targeting a wide range of hardware from consumer CPUs to high-end GPUs. It enables local, on-device LLM execution with minimal dependencies and state-of-the-art performance, making advanced AI accessible to developers and researchers.

How It Works

The project leverages the ggml tensor library for its core operations, enabling efficient computation on various hardware backends. It supports extensive quantization (1.5-bit to 8-bit) to reduce memory footprint and accelerate inference. Key optimizations include ARM NEON, Accelerate, and Metal for Apple Silicon, AVX/AVX2/AVX512/AMX for x86, and custom CUDA/HIP kernels for NVIDIA/AMD GPUs. It also offers Vulkan and SYCL backends, plus CPU+GPU hybrid inference for models exceeding VRAM.

Quick Start & Requirements

  • Install: Build from source (CMake) or use pre-built binaries from releases. Docker images are also available.
  • Prerequisites: C++ compiler (GCC, Clang), CMake. Optional: CUDA, ROCm, Metal, Vulkan SDK, OpenCL, SYCL, BLAS libraries depending on desired backend.
  • Models: Requires models in GGUF format. Conversion scripts are provided.
  • Docs: llama.cpp Documentation

Highlighted Details

  • Supports over 100 LLM architectures, including LLaMA, Mistral, Mixtral, DBRX, Falcon, Gemma, Mamba, and many more.
  • Offers an OpenAI-compatible HTTP server (llama-server) for easy integration.
  • Provides command-line tools (llama-cli, llama-perplexity, llama-bench) for direct interaction and performance analysis.
  • Extensive community bindings available for Python, Go, Node.js, Rust, C#, Swift, Java, and more.

Maintenance & Community

The project is actively maintained with a large and vibrant community. Notable contributions and integrations include bindings for numerous languages and frameworks, as well as UIs like LMStudio and LocalAI.

Licensing & Compatibility

The project is primarily licensed under the MIT License, allowing for broad commercial and closed-source use. Some associated tools or UIs might have different licenses (e.g., AGPL, proprietary).

Limitations & Caveats

While highly optimized, performance can vary significantly based on hardware, model size, and quantization level. Some advanced features or newer model architectures might require specific build flags or recent commits. The project is under continuous development, and breaking API changes can occur.

Health Check
Last commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)
312
Issues (30d)
185
Star History
5,501 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
5 more.

TensorRT-LLM by NVIDIA

0.6%
11k
LLM inference optimization SDK for NVIDIA GPUs
created 1 year ago
updated 14 hours ago
Feedback? Help us improve.