eLLM by lucienhuangfu

LLM inference on CPUs faster than GPUs for long contexts

Created 10 months ago

332 stars

Top 82.6% on SourcePulse

Project Summary

Summary

eLLM is an LLM inference framework enabling CPU servers (Intel Xeon/EPYC) to outperform GPUs, particularly for long-context tasks. It targets developers and researchers seeking cost-effective, low-latency LLM deployment without GPU hardware, aiming to democratize AI by leveraging existing CPU infrastructure.

How It Works

eLLM exploits CPU strengths (large memory, caches) via a 'trade storage for computation' philosophy. It uses an elastic static computation graph supporting variable input lengths without recompilation. A static-shape, non-paged KV cache with preallocated tensors minimizes overhead and cache misses. Massive-dimensional tensors enable single-pass Prefill for ultra-long contexts. Inference employs head-by-head attention, optimizing CPU cache residency by processing one attention head at a time to reduce repeated memory loads.

Quick Start & Requirements

Alpha status (v0.1.0-alpha.1) with a minimum viable prototype. Installation details beyond cloning the repo are sparse.

Prerequisites: Intel Xeon 4th Gen+ CPU (AMX support), sufficient DDR5 RAM. No GPU/NPU required.
Links: Project Home: https://github.com/lucienhuangfu/eLLM. Early Paper: https://www.researchgate.net/publication/393416965.

Highlighted Details

Pure CPU inference on server-grade hardware (Xeon/EPYC).
vLLM API compatible for seamless integration.
Outperforms multi-GPU systems on long-context Prefill-dominated workloads.
Achieves lower latency (TTFT) and higher throughput (QPS) for agents, code copilot, and RAG.
Supports near-unbounded context windows with significantly lower hardware and per-user inference costs compared to GPU deployments.

Maintenance & Community

Early alpha stage (v0.1.0-alpha.1, April 2026), open-sourced Dec 2025. Welcomes trainees and industry collaboration. Contact: lucienhuangfu@outlook.com. No specific community channels or roadmap links provided.

Licensing & Compatibility

Licensed under the permissive Apache 2.0 License, allowing commercial use and integration into closed-source projects.

Limitations & Caveats

Alpha release status. Model outputs are not fully consistent; experiments use random parameters without real weights. Key long-context experiments are ongoing (results mid-2026). Early paper may not reflect latest details. Short-context inference performance lags GPUs.

Health Check

Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

327 stars in the last 30 days