mistral.rs by EricLBuehler

LLM inference engine for blazing fast performance

Created 1 year ago

6,340 stars

Top 8.1% on SourcePulse

View on GitHub

14 Experts Love This Project

Jason Knight

Director AI Compilers at NVIDIA; Cofounder of OctoML

Omar Sanseviero

DevRel at Google DeepMind

Tim J. Baek

Founder of Open WebUI

Chip Huyen

Author of "AI Engineering", "Designing Machine Learning Systems"

and 10 more!

Project Summary

This project provides a high-performance inference engine for Large Language Models (LLMs) written in Rust, targeting developers and researchers needing efficient LLM deployment. It offers an OpenAI-compatible HTTP server and a Python API, enabling seamless integration into existing applications and workflows with a focus on speed and broad model compatibility.

How It Works

The engine leverages Rust for its performance and memory safety, integrating various acceleration techniques. It supports multiple quantization methods (GGML, GPTQ, AFQ, HQQ, FP8, BNB) and hardware backends including NVIDIA GPUs (CUDA, FlashAttention, cuDNN), Apple Silicon (Metal, Accelerate), and optimized CPU inference (MKL, AVX). Key features like PagedAttention, continuous batching, and speculative decoding contribute to its "blazingly fast" performance claims.

Quick Start & Requirements

Install: cargo install --path mistralrs-server --features cuda (or other features like metal, mkl). Python package available via PyPI.
Prerequisites: Rust toolchain, OpenSSL, pkg-config (Linux). Optional: Hugging Face CLI for gated models.
Resources: Build times vary based on features. Metal support is highlighted for Apple Silicon.
Docs: Rust Documentation, Python Documentation, Discord, Matrix

Highlighted Details

Supports a wide array of LLM architectures and quantization formats.
Offers advanced features like LoRA/X-LoRA, AnyMoE, and integrated web search.
Includes multimodal (vision) and diffusion model support.
Features automatic device mapping and tensor parallelism.

Maintenance & Community

Actively maintained with contributions welcome. Community channels include Discord and Matrix.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

While extensive, support for specific model architectures or quantization methods may evolve. Users might need to compile with specific features for optimal hardware acceleration.

Health Check

Last Commit

2 days ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

87 stars in the last 30 days