eLLM  by lucienhuangfu

LLM inference on CPUs faster than GPUs for long contexts

Created 10 months ago
332 stars

Top 82.6% on SourcePulse

GitHubView on GitHub
Project Summary

Summary

eLLM is an LLM inference framework enabling CPU servers (Intel Xeon/EPYC) to outperform GPUs, particularly for long-context tasks. It targets developers and researchers seeking cost-effective, low-latency LLM deployment without GPU hardware, aiming to democratize AI by leveraging existing CPU infrastructure.

How It Works

eLLM exploits CPU strengths (large memory, caches) via a 'trade storage for computation' philosophy. It uses an elastic static computation graph supporting variable input lengths without recompilation. A static-shape, non-paged KV cache with preallocated tensors minimizes overhead and cache misses. Massive-dimensional tensors enable single-pass Prefill for ultra-long contexts. Inference employs head-by-head attention, optimizing CPU cache residency by processing one attention head at a time to reduce repeated memory loads.

Quick Start & Requirements

Alpha status (v0.1.0-alpha.1) with a minimum viable prototype. Installation details beyond cloning the repo are sparse.

Highlighted Details

  • Pure CPU inference on server-grade hardware (Xeon/EPYC).
  • vLLM API compatible for seamless integration.
  • Outperforms multi-GPU systems on long-context Prefill-dominated workloads.
  • Achieves lower latency (TTFT) and higher throughput (QPS) for agents, code copilot, and RAG.
  • Supports near-unbounded context windows with significantly lower hardware and per-user inference costs compared to GPU deployments.

Maintenance & Community

Early alpha stage (v0.1.0-alpha.1, April 2026), open-sourced Dec 2025. Welcomes trainees and industry collaboration. Contact: lucienhuangfu@outlook.com. No specific community channels or roadmap links provided.

Licensing & Compatibility

Licensed under the permissive Apache 2.0 License, allowing commercial use and integration into closed-source projects.

Limitations & Caveats

Alpha release status. Model outputs are not fully consistent; experiments use random parameters without real weights. Key long-context experiments are ongoing (results mid-2026). Early paper may not reflect latest details. Short-context inference performance lags GPUs.

Health Check
Last Commit

3 days ago

Responsiveness

Inactive

Pull Requests (30d)
2
Issues (30d)
0
Star History
327 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.