This repository provides a benchmark for Large Language Model (LLM) inference frameworks, targeting developers and researchers evaluating performance and features. It offers a comparative analysis of various serving backends and their capabilities, aiding in the selection of optimal inference solutions.
How It Works
The benchmark evaluates LLM inference frameworks based on their ability to serve models, support for different backends, quantization methods, batching, and distributed inference. It presents detailed performance metrics like Tokens Per Second (TPS), Queries Per Second (QPS), and First Token Latency (FTL) under various configurations, including different batch sizes and quantization levels (e.g., 8-bit, 4-bit AWQ, GPTQ, GGUF).
Quick Start & Requirements
- Hardware: 1x NVIDIA RTX4090 24GB GPU, Intel Core i9-13900K CPU, 96GB RAM.
- Software: WSL2 on Windows 11, Ubuntu 22.04 guest OS, NVIDIA Driver 536.67, CUDA 12.2, PyTorch 2.1.1.
- Models: Benchmarks include
01-ai/Yi-6B-Chat
in BFloat16, 8-bit, 4-bit AWQ, and GGUF formats.
- Data: Prompt length 512, Max Tokens 200.
- Setup: No explicit setup commands are provided in the README, but it implies running inference servers and potentially custom scripts for benchmarking.
Highlighted Details
- Performance Leadership: vLLM, lmdeploy, and TGI demonstrate superior performance (TPS/QPS) in non-quantized and quantized benchmarks, particularly with 4-bit quantization.
- Quantization Impact: 4-bit quantization significantly boosts throughput and reduces latency across tested backends.
- Framework Features: Frameworks like Xinference and FastChat offer broader feature sets including WebUIs and multi-model support, though often with lower raw performance compared to specialized backends.
- Backend Variety: Supports a wide range of backends including Transformers, vLLM, ExLlamaV2, TensorRT, Candle, CTranslate2, TGI, llama.cpp, lmdeploy, and Deepspeed-FastGen.
Maintenance & Community
No specific information on maintainers, community channels, or roadmap is present in the provided README.
Licensing & Compatibility
The repository's license is not specified in the provided README.
Limitations & Caveats
- Benchmarks are specific to the hardware and software configuration detailed; results may vary significantly on different setups.
- Some backends (e.g., TensorRT, CTranslate2) had issues or were not benchmarked for certain configurations due to errors or compatibility problems.
- TGI does not natively support chat mode, requiring manual prompt parsing.