LLM inference engine for serving HuggingFace models at scale
Top 28.2% on sourcepulse
Aphrodite Engine is a high-performance inference engine designed for large-scale deployment of HuggingFace-compatible Large Language Models. It targets developers and platforms requiring efficient, concurrent LLM serving, offering significant throughput improvements and broad quantization support.
How It Works
Aphrodite Engine leverages vLLM's Paged Attention for efficient KV cache management and continuous batching, enabling high throughput for multiple concurrent users. It incorporates optimized CUDA kernels and supports a wide array of quantization formats (AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, etc.) and distributed inference techniques like tensor and pipeline parallelism. This combination allows for substantial memory savings and increased inference speed.
Quick Start & Requirements
pip install -U aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl
aphrodite run <model_name>
(e.g., aphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct
)--gpu-memory-utilization
or --single-user-mode
.Highlighted Details
Maintenance & Community
Developed collaboratively by PygmalionAI and Ruliad AI. Sponsors include Arc Compute, Prime Intellect, PygmalionAI, and Ruliad AI. Contributions are welcome.
Licensing & Compatibility
The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
Windows installation requires building from source. The project is not associated with any cryptocurrencies, as noted by the developers.
2 days ago
1 day