dash-infer  by modelscope

LLM inference engine for optimized performance across diverse hardware

created 1 year ago
262 stars

Top 97.8% on sourcepulse

GitHubView on GitHub
Project Summary

DashInfer is a C++ LLM inference engine designed for high-performance serving across diverse hardware, including CUDA GPUs, x86, and ARMv9 CPUs. It targets developers and researchers needing efficient, production-ready LLM deployment with features like continuous batching, paged attention, and quantization, aiming to match or exceed the performance of existing solutions like vLLM.

How It Works

DashInfer leverages a lightweight, C++ runtime with minimal dependencies and static linking for broad compatibility. Its core innovations include SpanAttention, a custom paged attention mechanism, and InstantQuant (IQ) for weight-only quantization without fine-tuning. These, combined with optimized kernels for GEMM/GEMV and support for techniques like prefix caching and guided decoding, enable high throughput and low latency inference.

Quick Start & Requirements

  • Install: pip install dashinfer
  • Prerequisites: CUDA 11.4-12.4 for GPU support; AVX2 for x86 CPUs; SVE for ARMv9 CPUs.
  • Documentation: dashinfer.readthedocs.io
  • Examples: Available in the repository's examples directory.

Highlighted Details

  • Supports mainstream LLMs (Qwen, LLaMA, ChatGLM) and multimodal models (Qwen-VL).
  • Offers OpenAI-compatible API server integration via FastChat.
  • Provides both C++ and Python interfaces for easy integration.
  • Features include prefix caching (GPU/CPU swap), guided decoding, and support for FP8, INT8, and INT4 quantization.

Maintenance & Community

  • Actively developed with recent releases (v2.0 in Dec 2024).
  • Roadmap includes ROCm support and further MoE optimizations.

Licensing & Compatibility

  • Licensed under Apache 2.0.
  • Permissive license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is under active development, with some planned features like ROCm support and advanced MoE operators still pending. Accuracy for custom quantization strategies may require business-specific testing.

Health Check
Last commit

4 days ago

Responsiveness

1 day

Pull Requests (30d)
2
Issues (30d)
3
Star History
20 stars in the last 90 days

Explore Similar Projects

Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Tobi Lutke Tobi Lutke(Cofounder of Shopify), and
27 more.

vllm by vllm-project

1.0%
54k
LLM serving engine for high-throughput, memory-efficient inference
created 2 years ago
updated 14 hours ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 14 hours ago
Feedback? Help us improve.