prima.cpp  by Lizonghang

Distributed llama.cpp implementation for low-resource LLM inference

Created 11 months ago
997 stars

Top 37.3% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

prima.cpp enables running large language models (LLMs) like 70B-parameter models on low-resource home clusters, including laptops, desktops, and mobile devices, with or without GPUs. It addresses memory constraints and performance limitations, offering a solution for private, local LLM inference.

How It Works

prima.cpp leverages llama.cpp's foundation and introduces a distributed, heterogeneity-aware approach. It employs mmap for lazy loading of model weights, reducing memory pressure. Key innovations include piped-ring parallelism with prefetching to overlap disk I/O with computation and an intelligent scheduler that distributes model layers across devices based on their CPU, GPU, RAM, and disk speed. This allows for efficient utilization of diverse hardware in a cluster.

Quick Start & Requirements

  • Install: Build from source using make. CUDA support requires GGML_CUDA=1.
  • Prerequisites: GCC >= 9.4.0, Make >= 4.2.1, CMake >= 3.16.3, fio >= 3.16, ZMQ >= 4.3.2, HiGHS >= 1.9.0. CUDA is optional for GPU acceleration.
  • Setup: Requires compiling the project and downloading GGUF model files.
  • Docs: llama.cpp (as prima.cpp builds upon it).

Highlighted Details

  • Claims 15x speedup over llama.cpp for some models.
  • Supports heterogeneous clusters with devices running macOS, Linux, Android, and HarmonyOS.
  • Offers GPU and CPU offloading, with plans to support Vulkan and AMD GPUs.
  • Supports various quantization formats (Q4K, Q6K, Q80, IQ1) for Llama, Qwen, and DeepSeek models.

Maintenance & Community

The project is actively developed by Lizonghang and contributors. Links to community resources like Discord/Slack are not explicitly provided in the README.

Licensing & Compatibility

The project is primarily based on llama.cpp, which is typically released under a permissive MIT license. However, specific licensing for prima.cpp itself is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.

Limitations & Caveats

Windows support is not yet available. Currently, only CUDA-based GPUs are supported, excluding Vulkan and AMD GPUs. The README notes that initial layer splitting can be less efficient, with plans to optimize this in future updates.

Health Check
Last Commit

2 months ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Wing Lian Wing Lian(Founder of Axolotl AI) and Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems").

airllm by lyogavin

0.1%
6k
Inference optimization for LLMs on low-resource hardware
Created 2 years ago
Updated 2 weeks ago
Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.4%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 1 week ago
Feedback? Help us improve.