inferrs  by ericcurtin

LLM inference server optimized for resource efficiency and broad API compatibility

Created 3 weeks ago

New!

369 stars

Top 76.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

A high-performance LLM inference server built in Rust, inferrs targets developers and users needing a flexible, lightweight, and fast solution for serving large language models. It aims to provide a rich feature set, including broad API compatibility and efficient resource usage, making it suitable for various deployment scenarios where memory and binary footprint are critical considerations.

How It Works

inferrs is implemented in Rust, resulting in a single, lightweight binary. It employs TurboQuant for KV cache management, alongside per-context allocation, distinguishing itself from solutions that may consume significant GPU memory. The architecture features an axum-based HTTP server handling requests via Server-Sent Events (SSE), which communicates with a backend engine composed of a scheduler, transformer, KV cache, and sampler. This design prioritizes memory efficiency and performance.

Quick Start & Requirements

Installation is available via package managers: brew install inferrs on macOS/Linux, or scoop install inferrs on Windows after adding the ericcurtin/scoop-inferrs bucket. Models can be served using commands like inferrs run <model_path> or inferrs serve <model_path>. The project supports a wide array of hardware backends including CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan, and CPU.

Highlighted Details

  • Offers OpenAI, Anthropic, and Ollama compatible APIs for seamless integration.
  • Supports a broad range of hardware acceleration backends: CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan, and CPU.
  • Features TurboQuant for efficient KV cache management, aiming for lower memory overhead.
  • Delivers a single binary deployment, simplifying setup and reducing dependency management.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (like Discord or Slack), sponsorships, or roadmap details.

Licensing & Compatibility

The README does not explicitly state the project's license. Therefore, compatibility for commercial use or closed-source linking cannot be determined from the provided text.

Limitations & Caveats

The README focuses on the project's strengths and does not detail specific limitations, known bugs, or unsupported platforms. The absence of explicit licensing information presents a significant caveat for potential adopters.

Health Check
Last Commit

1 day ago

Responsiveness

Inactive

Pull Requests (30d)
143
Issues (30d)
14
Star History
372 stars in the last 21 days

Explore Similar Projects

Starred by Taranjeet Singh Taranjeet Singh(Cofounder of Mem0), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
4 more.

LMCache by LMCache

0.9%
8k
LLM serving engine extension for reduced TTFT and increased throughput
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.