C/C++ library for local LLM inference
Top 0.1% on sourcepulse
llama.cpp is a C/C++ library and toolset for efficient Large Language Model (LLM) inference, targeting a wide range of hardware from consumer CPUs to high-end GPUs. It enables local, on-device LLM execution with minimal dependencies and state-of-the-art performance, making advanced AI accessible to developers and researchers.
How It Works
The project leverages the ggml tensor library for its core operations, enabling efficient computation on various hardware backends. It supports extensive quantization (1.5-bit to 8-bit) to reduce memory footprint and accelerate inference. Key optimizations include ARM NEON, Accelerate, and Metal for Apple Silicon, AVX/AVX2/AVX512/AMX for x86, and custom CUDA/HIP kernels for NVIDIA/AMD GPUs. It also offers Vulkan and SYCL backends, plus CPU+GPU hybrid inference for models exceeding VRAM.
Quick Start & Requirements
Highlighted Details
llama-server
) for easy integration.llama-cli
, llama-perplexity
, llama-bench
) for direct interaction and performance analysis.Maintenance & Community
The project is actively maintained with a large and vibrant community. Notable contributions and integrations include bindings for numerous languages and frameworks, as well as UIs like LMStudio and LocalAI.
Licensing & Compatibility
The project is primarily licensed under the MIT License, allowing for broad commercial and closed-source use. Some associated tools or UIs might have different licenses (e.g., AGPL, proprietary).
Limitations & Caveats
While highly optimized, performance can vary significantly based on hardware, model size, and quantization level. Some advanced features or newer model architectures might require specific build flags or recent commits. The project is under continuous development, and breaking API changes can occur.
10 hours ago
1 day