aphrodite-engine by aphrodite-engine

LLM inference engine for serving HuggingFace models at scale

Created 2 years ago

1,615 stars

Top 25.8% on SourcePulse

View on GitHub

6 Experts Love This Project

Luis Capelo

Cofounder of Lightning AI

Michael Han

Cofounder of Unsloth

Vincent Weisser

Cofounder of Prime Intellect

Teknium

Cofounder of Nous Research

and 2 more!

Project Summary

Aphrodite Engine is a high-performance inference engine designed for large-scale deployment of HuggingFace-compatible Large Language Models. It targets developers and platforms requiring efficient, concurrent LLM serving, offering significant throughput improvements and broad quantization support.

How It Works

Aphrodite Engine leverages vLLM's Paged Attention for efficient KV cache management and continuous batching, enabling high throughput for multiple concurrent users. It incorporates optimized CUDA kernels and supports a wide array of quantization formats (AQLM, AWQ, Bitsandbytes, GGUF, GPTQ, etc.) and distributed inference techniques like tensor and pipeline parallelism. This combination allows for substantial memory savings and increased inference speed.

Quick Start & Requirements

Install: pip install -U aphrodite-engine --extra-index-url https://downloads.pygmalion.chat/whl
Run: aphrodite run <model_name> (e.g., aphrodite run meta-llama/Meta-Llama-3.1-8B-Instruct)
Prerequisites: Python 3.9-3.12, CUDA >= 12. Supports Linux and Windows (requires building from source). Broad hardware support including NVIDIA, AMD, Intel GPUs, TPUs, and Inferentia.
Resources: By default, uses ~90% of GPU VRAM. Can be limited with --gpu-memory-utilization or --single-user-mode.
Docs: Exhaustive documentation available.
Demo: Interactive demo available.

Highlighted Details

Supports over a dozen quantization formats for memory and speed optimization.
Features advanced sampling methods like DRY and XTC.
Offers OpenAI-compatible API server for easy integration with existing UIs.
Achieves "extremely high throughput" with FP16 models in FP2-FP7 quant formats (v0.6.1).

Maintenance & Community

Developed collaboratively by PygmalionAI and Ruliad AI. Sponsors include Arc Compute, Prime Intellect, PygmalionAI, and Ruliad AI. Contributions are welcome.

Licensing & Compatibility

The README does not explicitly state a license. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

Windows installation requires building from source. The project is not associated with any cryptocurrencies, as noted by the developers.

Health Check

Last Commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History