dash-infer by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago

273 stars

Top 94.7% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yineng Zhang

Inference Lead at SGLang; Research Scientist at Together AI

Project Summary

DashInfer is a C++ LLM inference engine designed for high-performance serving across diverse hardware, including CUDA GPUs, x86, and ARMv9 CPUs. It targets developers and researchers needing efficient, production-ready LLM deployment with features like continuous batching, paged attention, and quantization, aiming to match or exceed the performance of existing solutions like vLLM.

How It Works

DashInfer leverages a lightweight, C++ runtime with minimal dependencies and static linking for broad compatibility. Its core innovations include SpanAttention, a custom paged attention mechanism, and InstantQuant (IQ) for weight-only quantization without fine-tuning. These, combined with optimized kernels for GEMM/GEMV and support for techniques like prefix caching and guided decoding, enable high throughput and low latency inference.

Quick Start & Requirements

Install: pip install dashinfer
Prerequisites: CUDA 11.4-12.4 for GPU support; AVX2 for x86 CPUs; SVE for ARMv9 CPUs.
Documentation: dashinfer.readthedocs.io
Examples: Available in the repository's examples directory.

Highlighted Details

Supports mainstream LLMs (Qwen, LLaMA, ChatGLM) and multimodal models (Qwen-VL).
Offers OpenAI-compatible API server integration via FastChat.
Provides both C++ and Python interfaces for easy integration.
Features include prefix caching (GPU/CPU swap), guided decoding, and support for FP8, INT8, and INT4 quantization.

Maintenance & Community

Actively developed with recent releases (v2.0 in Dec 2024).
Roadmap includes ROCm support and further MoE optimizations.

Licensing & Compatibility

Licensed under Apache 2.0.
Permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is under active development, with some planned features like ROCm support and advanced MoE operators still pending. Accuracy for custom quantization strategies may require business-specific testing.

Health Check

Last Commit

5 months ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days