prima.cpp by Lizonghang

Distributed llama.cpp implementation for low-resource LLM inference

Created 1 year ago

997 stars

Top 37.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Salvatore Sanfilippo

Author of Redis

Project Summary

prima.cpp enables running large language models (LLMs) like 70B-parameter models on low-resource home clusters, including laptops, desktops, and mobile devices, with or without GPUs. It addresses memory constraints and performance limitations, offering a solution for private, local LLM inference.

How It Works

prima.cpp leverages llama.cpp's foundation and introduces a distributed, heterogeneity-aware approach. It employs mmap for lazy loading of model weights, reducing memory pressure. Key innovations include piped-ring parallelism with prefetching to overlap disk I/O with computation and an intelligent scheduler that distributes model layers across devices based on their CPU, GPU, RAM, and disk speed. This allows for efficient utilization of diverse hardware in a cluster.

Quick Start & Requirements

Install: Build from source using make. CUDA support requires GGML_CUDA=1.
Prerequisites: GCC >= 9.4.0, Make >= 4.2.1, CMake >= 3.16.3, fio >= 3.16, ZMQ >= 4.3.2, HiGHS >= 1.9.0. CUDA is optional for GPU acceleration.
Setup: Requires compiling the project and downloading GGUF model files.
Docs: llama.cpp (as prima.cpp builds upon it).

Highlighted Details

Claims 15x speedup over llama.cpp for some models.
Supports heterogeneous clusters with devices running macOS, Linux, Android, and HarmonyOS.
Offers GPU and CPU offloading, with plans to support Vulkan and AMD GPUs.
Supports various quantization formats (Q4K, Q6K, Q80, IQ1) for Llama, Qwen, and DeepSeek models.

Maintenance & Community

The project is actively developed by Lizonghang and contributors. Links to community resources like Discord/Slack are not explicitly provided in the README.

Licensing & Compatibility

The project is primarily based on llama.cpp, which is typically released under a permissive MIT license. However, specific licensing for prima.cpp itself is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.

Limitations & Caveats

Windows support is not yet available. Currently, only CUDA-based GPUs are supported, excluding Vulkan and AMD GPUs. The README notes that initial layer splitting can be less efficient, with plans to optimize this in future updates.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days