dash-infer  by modelscope

LLM inference engine for optimized performance across diverse hardware

Created 1 year ago
264 stars

Top 96.8% on SourcePulse

GitHubView on GitHub
Project Summary

DashInfer is a C++ LLM inference engine designed for high-performance serving across diverse hardware, including CUDA GPUs, x86, and ARMv9 CPUs. It targets developers and researchers needing efficient, production-ready LLM deployment with features like continuous batching, paged attention, and quantization, aiming to match or exceed the performance of existing solutions like vLLM.

How It Works

DashInfer leverages a lightweight, C++ runtime with minimal dependencies and static linking for broad compatibility. Its core innovations include SpanAttention, a custom paged attention mechanism, and InstantQuant (IQ) for weight-only quantization without fine-tuning. These, combined with optimized kernels for GEMM/GEMV and support for techniques like prefix caching and guided decoding, enable high throughput and low latency inference.

Quick Start & Requirements

  • Install: pip install dashinfer
  • Prerequisites: CUDA 11.4-12.4 for GPU support; AVX2 for x86 CPUs; SVE for ARMv9 CPUs.
  • Documentation: dashinfer.readthedocs.io
  • Examples: Available in the repository's examples directory.

Highlighted Details

  • Supports mainstream LLMs (Qwen, LLaMA, ChatGLM) and multimodal models (Qwen-VL).
  • Offers OpenAI-compatible API server integration via FastChat.
  • Provides both C++ and Python interfaces for easy integration.
  • Features include prefix caching (GPU/CPU swap), guided decoding, and support for FP8, INT8, and INT4 quantization.

Maintenance & Community

  • Actively developed with recent releases (v2.0 in Dec 2024).
  • Roadmap includes ROCm support and further MoE optimizations.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is under active development, with some planned features like ROCm support and advanced MoE operators still pending. Accuracy for custom quantization strategies may require business-specific testing.

Health Check
Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Jason Knight Jason Knight(Director AI Compilers at NVIDIA; Cofounder of OctoML), Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), and
11 more.

mistral.rs by EricLBuehler

0.3%
6k
LLM inference engine for blazing fast performance
Created 1 year ago
Updated 21 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
58 more.

vllm by vllm-project

1.1%
58k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 11 hours ago
Feedback? Help us improve.