prima.cpp  by Lizonghang

Distributed llama.cpp implementation for low-resource LLM inference

created 9 months ago
993 stars

Top 38.1% on sourcepulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

prima.cpp enables running large language models (LLMs) like 70B-parameter models on low-resource home clusters, including laptops, desktops, and mobile devices, with or without GPUs. It addresses memory constraints and performance limitations, offering a solution for private, local LLM inference.

How It Works

prima.cpp leverages llama.cpp's foundation and introduces a distributed, heterogeneity-aware approach. It employs mmap for lazy loading of model weights, reducing memory pressure. Key innovations include piped-ring parallelism with prefetching to overlap disk I/O with computation and an intelligent scheduler that distributes model layers across devices based on their CPU, GPU, RAM, and disk speed. This allows for efficient utilization of diverse hardware in a cluster.

Quick Start & Requirements

  • Install: Build from source using make. CUDA support requires GGML_CUDA=1.
  • Prerequisites: GCC >= 9.4.0, Make >= 4.2.1, CMake >= 3.16.3, fio >= 3.16, ZMQ >= 4.3.2, HiGHS >= 1.9.0. CUDA is optional for GPU acceleration.
  • Setup: Requires compiling the project and downloading GGUF model files.
  • Docs: llama.cpp (as prima.cpp builds upon it).

Highlighted Details

  • Claims 15x speedup over llama.cpp for some models.
  • Supports heterogeneous clusters with devices running macOS, Linux, Android, and HarmonyOS.
  • Offers GPU and CPU offloading, with plans to support Vulkan and AMD GPUs.
  • Supports various quantization formats (Q4K, Q6K, Q80, IQ1) for Llama, Qwen, and DeepSeek models.

Maintenance & Community

The project is actively developed by Lizonghang and contributors. Links to community resources like Discord/Slack are not explicitly provided in the README.

Licensing & Compatibility

The project is primarily based on llama.cpp, which is typically released under a permissive MIT license. However, specific licensing for prima.cpp itself is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.

Limitations & Caveats

Windows support is not yet available. Currently, only CUDA-based GPUs are supported, excluding Vulkan and AMD GPUs. The README notes that initial layer splitting can be less efficient, with plans to optimize this in future updates.

Health Check
Last commit

1 week ago

Responsiveness

1 day

Pull Requests (30d)
1
Issues (30d)
1
Star History
191 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Jaret Burkett Jaret Burkett(Founder of Ostris), and
1 more.

nunchaku by nunchaku-tech

2.1%
3k
High-performance 4-bit diffusion model inference engine
created 8 months ago
updated 14 hours ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Ying Sheng Ying Sheng(Author of SGLang).

fastllm by ztxz16

0.4%
4k
High-performance C++ LLM inference library
created 2 years ago
updated 2 weeks ago
Feedback? Help us improve.