Discover and explore top open-source AI tools and projects—updated daily.
modelscopeLLM inference engine for optimized performance across diverse hardware
Top 94.7% on SourcePulse
DashInfer is a C++ LLM inference engine designed for high-performance serving across diverse hardware, including CUDA GPUs, x86, and ARMv9 CPUs. It targets developers and researchers needing efficient, production-ready LLM deployment with features like continuous batching, paged attention, and quantization, aiming to match or exceed the performance of existing solutions like vLLM.
How It Works
DashInfer leverages a lightweight, C++ runtime with minimal dependencies and static linking for broad compatibility. Its core innovations include SpanAttention, a custom paged attention mechanism, and InstantQuant (IQ) for weight-only quantization without fine-tuning. These, combined with optimized kernels for GEMM/GEMV and support for techniques like prefix caching and guided decoding, enable high throughput and low latency inference.
Quick Start & Requirements
pip install dashinferexamples directory.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The project is under active development, with some planned features like ROCm support and advanced MoE operators still pending. Accuracy for custom quantization strategies may require business-specific testing.
5 months ago
1 day
ScalingIntelligence
zhihu
b4rtaz
EricLBuehler
vllm-project