mistral.rs  by EricLBuehler

LLM inference engine for blazing fast performance

Created 1 year ago
6,189 stars

Top 8.3% on SourcePulse

GitHubView on GitHub
Project Summary

This project provides a high-performance inference engine for Large Language Models (LLMs) written in Rust, targeting developers and researchers needing efficient LLM deployment. It offers an OpenAI-compatible HTTP server and a Python API, enabling seamless integration into existing applications and workflows with a focus on speed and broad model compatibility.

How It Works

The engine leverages Rust for its performance and memory safety, integrating various acceleration techniques. It supports multiple quantization methods (GGML, GPTQ, AFQ, HQQ, FP8, BNB) and hardware backends including NVIDIA GPUs (CUDA, FlashAttention, cuDNN), Apple Silicon (Metal, Accelerate), and optimized CPU inference (MKL, AVX). Key features like PagedAttention, continuous batching, and speculative decoding contribute to its "blazingly fast" performance claims.

Quick Start & Requirements

  • Install: cargo install --path mistralrs-server --features cuda (or other features like metal, mkl). Python package available via PyPI.
  • Prerequisites: Rust toolchain, OpenSSL, pkg-config (Linux). Optional: Hugging Face CLI for gated models.
  • Resources: Build times vary based on features. Metal support is highlighted for Apple Silicon.
  • Docs: Rust Documentation, Python Documentation, Discord, Matrix

Highlighted Details

  • Supports a wide array of LLM architectures and quantization formats.
  • Offers advanced features like LoRA/X-LoRA, AnyMoE, and integrated web search.
  • Includes multimodal (vision) and diffusion model support.
  • Features automatic device mapping and tensor parallelism.

Maintenance & Community

Actively maintained with contributions welcome. Community channels include Discord and Matrix.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

While extensive, support for specific model architectures or quantization methods may evolve. Users might need to compile with specific features for optimal hardware acceleration.

Health Check
Last Commit

10 hours ago

Responsiveness

1 day

Pull Requests (30d)
28
Issues (30d)
20
Star History
65 stars in the last 30 days

Explore Similar Projects

Starred by Lianmin Zheng Lianmin Zheng(Coauthor of SGLang, vLLM), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
1 more.

MiniCPM by OpenBMB

0.1%
8k
Ultra-efficient LLMs for end devices, achieving 5x+ speedup
Created 1 year ago
Updated 3 weeks ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Clement Delangue Clement Delangue(Cofounder of Hugging Face), and
60 more.

vllm by vllm-project

1.1%
62k
LLM serving engine for high-throughput, memory-efficient inference
Created 2 years ago
Updated 5 hours ago
Feedback? Help us improve.