mistral.rs  by EricLBuehler

LLM inference engine for blazing fast performance

created 1 year ago
5,957 stars

Top 8.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project provides a high-performance inference engine for Large Language Models (LLMs) written in Rust, targeting developers and researchers needing efficient LLM deployment. It offers an OpenAI-compatible HTTP server and a Python API, enabling seamless integration into existing applications and workflows with a focus on speed and broad model compatibility.

How It Works

The engine leverages Rust for its performance and memory safety, integrating various acceleration techniques. It supports multiple quantization methods (GGML, GPTQ, AFQ, HQQ, FP8, BNB) and hardware backends including NVIDIA GPUs (CUDA, FlashAttention, cuDNN), Apple Silicon (Metal, Accelerate), and optimized CPU inference (MKL, AVX). Key features like PagedAttention, continuous batching, and speculative decoding contribute to its "blazingly fast" performance claims.

Quick Start & Requirements

  • Install: cargo install --path mistralrs-server --features cuda (or other features like metal, mkl). Python package available via PyPI.
  • Prerequisites: Rust toolchain, OpenSSL, pkg-config (Linux). Optional: Hugging Face CLI for gated models.
  • Resources: Build times vary based on features. Metal support is highlighted for Apple Silicon.
  • Docs: Rust Documentation, Python Documentation, Discord, Matrix

Highlighted Details

  • Supports a wide array of LLM architectures and quantization formats.
  • Offers advanced features like LoRA/X-LoRA, AnyMoE, and integrated web search.
  • Includes multimodal (vision) and diffusion model support.
  • Features automatic device mapping and tensor parallelism.

Maintenance & Community

Actively maintained with contributions welcome. Community channels include Discord and Matrix.

Licensing & Compatibility

The project is licensed under Apache 2.0, allowing for commercial use and integration with closed-source applications.

Limitations & Caveats

While extensive, support for specific model architectures or quantization methods may evolve. Users might need to compile with specific features for optimal hardware acceleration.

Health Check
Last commit

16 hours ago

Responsiveness

1 day

Pull Requests (30d)
38
Issues (30d)
53
Star History
453 stars in the last 90 days

Explore Similar Projects

Starred by Jeff Hammerbacher Jeff Hammerbacher(Cofounder of Cloudera) and Stas Bekman Stas Bekman(Author of Machine Learning Engineering Open Book; Research Engineer at Snowflake).

InternEvo by InternLM

1.0%
402
Lightweight training framework for model pre-training
created 1 year ago
updated 1 week ago
Starred by Patrick von Platen Patrick von Platen(Core Contributor to Hugging Face Transformers and Diffusers), Michael Han Michael Han(Cofounder of Unsloth), and
1 more.

ktransformers by kvcache-ai

0.4%
15k
Framework for LLM inference optimization experimentation
created 1 year ago
updated 2 days ago
Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Nat Friedman Nat Friedman(Former CEO of GitHub), and
32 more.

llama.cpp by ggml-org

0.4%
84k
C/C++ library for local LLM inference
created 2 years ago
updated 10 hours ago
Feedback? Help us improve.