Discover and explore top open-source AI tools and projects—updated daily.
lucienhuangfuLLM inference on CPUs faster than GPUs for long contexts
Top 82.6% on SourcePulse
Summary
eLLM is an LLM inference framework enabling CPU servers (Intel Xeon/EPYC) to outperform GPUs, particularly for long-context tasks. It targets developers and researchers seeking cost-effective, low-latency LLM deployment without GPU hardware, aiming to democratize AI by leveraging existing CPU infrastructure.
How It Works
eLLM exploits CPU strengths (large memory, caches) via a 'trade storage for computation' philosophy. It uses an elastic static computation graph supporting variable input lengths without recompilation. A static-shape, non-paged KV cache with preallocated tensors minimizes overhead and cache misses. Massive-dimensional tensors enable single-pass Prefill for ultra-long contexts. Inference employs head-by-head attention, optimizing CPU cache residency by processing one attention head at a time to reduce repeated memory loads.
Quick Start & Requirements
Alpha status (v0.1.0-alpha.1) with a minimum viable prototype. Installation details beyond cloning the repo are sparse.
Highlighted Details
Maintenance & Community
Early alpha stage (v0.1.0-alpha.1, April 2026), open-sourced Dec 2025. Welcomes trainees and industry collaboration. Contact: lucienhuangfu@outlook.com. No specific community channels or roadmap links provided.
Licensing & Compatibility
Licensed under the permissive Apache 2.0 License, allowing commercial use and integration into closed-source projects.
Limitations & Caveats
Alpha release status. Model outputs are not fully consistent; experiments use random parameters without real weights. Key long-context experiments are ongoing (results mid-2026). Early paper may not reflect latest details. Short-context inference performance lags GPUs.
3 days ago
Inactive