Discover and explore top open-source AI tools and projects—updated daily.
LizonghangDistributed llama.cpp implementation for low-resource LLM inference
Top 36.9% on SourcePulse
prima.cpp enables running large language models (LLMs) like 70B-parameter models on low-resource home clusters, including laptops, desktops, and mobile devices, with or without GPUs. It addresses memory constraints and performance limitations, offering a solution for private, local LLM inference.
How It Works
prima.cpp leverages llama.cpp's foundation and introduces a distributed, heterogeneity-aware approach. It employs mmap for lazy loading of model weights, reducing memory pressure. Key innovations include piped-ring parallelism with prefetching to overlap disk I/O with computation and an intelligent scheduler that distributes model layers across devices based on their CPU, GPU, RAM, and disk speed. This allows for efficient utilization of diverse hardware in a cluster.
Quick Start & Requirements
make. CUDA support requires GGML_CUDA=1.Highlighted Details
Maintenance & Community
The project is actively developed by Lizonghang and contributors. Links to community resources like Discord/Slack are not explicitly provided in the README.
Licensing & Compatibility
The project is primarily based on llama.cpp, which is typically released under a permissive MIT license. However, specific licensing for prima.cpp itself is not explicitly stated in the README. Compatibility for commercial use would depend on the final license.
Limitations & Caveats
Windows support is not yet available. Currently, only CUDA-based GPUs are supported, excluding Vulkan and AMD GPUs. The README notes that initial layer splitting can be less efficient, with plans to optimize this in future updates.
10 months ago
1 day
b4rtaz
Mega4alik
lyogavin