LLM inference engine for optimized performance across diverse hardware
Top 97.8% on sourcepulse
DashInfer is a C++ LLM inference engine designed for high-performance serving across diverse hardware, including CUDA GPUs, x86, and ARMv9 CPUs. It targets developers and researchers needing efficient, production-ready LLM deployment with features like continuous batching, paged attention, and quantization, aiming to match or exceed the performance of existing solutions like vLLM.
How It Works
DashInfer leverages a lightweight, C++ runtime with minimal dependencies and static linking for broad compatibility. Its core innovations include SpanAttention, a custom paged attention mechanism, and InstantQuant (IQ) for weight-only quantization without fine-tuning. These, combined with optimized kernels for GEMM/GEMV and support for techniques like prefix caching and guided decoding, enable high throughput and low latency inference.
Quick Start & Requirements
pip install dashinfer
examples
directory.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is under active development, with some planned features like ROCm support and advanced MoE operators still pending. Accuracy for custom quantization strategies may require business-specific testing.
4 days ago
1 day