llm-inference-benchmark by ninehills

LLM inference benchmark for comparing frameworks

Created 2 years ago

430 stars

Top 69.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

This repository provides a benchmark for Large Language Model (LLM) inference frameworks, targeting developers and researchers evaluating performance and features. It offers a comparative analysis of various serving backends and their capabilities, aiding in the selection of optimal inference solutions.

How It Works

The benchmark evaluates LLM inference frameworks based on their ability to serve models, support for different backends, quantization methods, batching, and distributed inference. It presents detailed performance metrics like Tokens Per Second (TPS), Queries Per Second (QPS), and First Token Latency (FTL) under various configurations, including different batch sizes and quantization levels (e.g., 8-bit, 4-bit AWQ, GPTQ, GGUF).

Quick Start & Requirements

Hardware: 1x NVIDIA RTX4090 24GB GPU, Intel Core i9-13900K CPU, 96GB RAM.
Software: WSL2 on Windows 11, Ubuntu 22.04 guest OS, NVIDIA Driver 536.67, CUDA 12.2, PyTorch 2.1.1.
Models: Benchmarks include 01-ai/Yi-6B-Chat in BFloat16, 8-bit, 4-bit AWQ, and GGUF formats.
Data: Prompt length 512, Max Tokens 200.
Setup: No explicit setup commands are provided in the README, but it implies running inference servers and potentially custom scripts for benchmarking.

Highlighted Details

Performance Leadership: vLLM, lmdeploy, and TGI demonstrate superior performance (TPS/QPS) in non-quantized and quantized benchmarks, particularly with 4-bit quantization.
Quantization Impact: 4-bit quantization significantly boosts throughput and reduces latency across tested backends.
Framework Features: Frameworks like Xinference and FastChat offer broader feature sets including WebUIs and multi-model support, though often with lower raw performance compared to specialized backends.
Backend Variety: Supports a wide range of backends including Transformers, vLLM, ExLlamaV2, TensorRT, Candle, CTranslate2, TGI, llama.cpp, lmdeploy, and Deepspeed-FastGen.

Maintenance & Community

No specific information on maintainers, community channels, or roadmap is present in the provided README.

Licensing & Compatibility

The repository's license is not specified in the provided README.

Limitations & Caveats

Benchmarks are specific to the hardware and software configuration detailed; results may vary significantly on different setups.
Some backends (e.g., TensorRT, CTranslate2) had issues or were not benchmarked for certain configurations due to errors or compatibility problems.
TGI does not natively support chat mode, requiring manual prompt parsing.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days