Distributed llama.cpp implementation for low-resource LLM inference
Top 38.1% on sourcepulse
prima.cpp enables running large language models (LLMs) like 70B-parameter models on low-resource home clusters, including laptops, desktops, and mobile devices, with or without GPUs. It addresses memory constraints and performance limitations, offering a solution for private, local LLM inference.
How It Works
prima.cpp leverages llama.cpp's foundation and introduces a distributed, heterogeneity-aware approach. It employs mmap for lazy loading of model weights, reducing memory pressure. Key innovations include piped-ring parallelism with prefetching to overlap disk I/O with computation and an intelligent scheduler that distributes model layers across devices based on their CPU, GPU, RAM, and disk speed. This allows for efficient utilization of diverse hardware in a cluster.
Quick Start & Requirements
make
. CUDA support requires GGML_CUDA=1
.Highlighted Details
Maintenance & Community
The project is actively developed by Lizonghang and contributors. Links to community resources like Discord/Slack are not explicitly provided in the README.
Licensing & Compatibility
The project is primarily based on llama.cpp, which is typically released under a permissive MIT license. However, specific licensing for prima.cpp itself is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.
Limitations & Caveats
Windows support is not yet available. Currently, only CUDA-based GPUs are supported, excluding Vulkan and AMD GPUs. The README notes that initial layer splitting can be less efficient, with plans to optimize this in future updates.
1 week ago
1 day