inferrs by ericcurtin

LLM inference server optimized for resource efficiency and broad API compatibility

Created 3 months ago

459 stars

Top 65.1% on SourcePulse

View on GitHub

1 Expert Loves This Project

Alex Cheema

Cofounder of EXO Labs

Project Summary

A high-performance LLM inference server built in Rust, inferrs targets developers and users needing a flexible, lightweight, and fast solution for serving large language models. It aims to provide a rich feature set, including broad API compatibility and efficient resource usage, making it suitable for various deployment scenarios where memory and binary footprint are critical considerations.

How It Works

inferrs is implemented in Rust, resulting in a single, lightweight binary. It employs TurboQuant for KV cache management, alongside per-context allocation, distinguishing itself from solutions that may consume significant GPU memory. The architecture features an axum-based HTTP server handling requests via Server-Sent Events (SSE), which communicates with a backend engine composed of a scheduler, transformer, KV cache, and sampler. This design prioritizes memory efficiency and performance.

Quick Start & Requirements

Installation is available via package managers: brew install inferrs on macOS/Linux, or scoop install inferrs on Windows after adding the ericcurtin/scoop-inferrs bucket. Models can be served using commands like inferrs run <model_path> or inferrs serve <model_path>. The project supports a wide array of hardware backends including CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan, and CPU.

Highlighted Details

Offers OpenAI, Anthropic, and Ollama compatible APIs for seamless integration.
Supports a broad range of hardware acceleration backends: CUDA, ROCm, Metal, Hexagon, OpenVino, MUSA, CANN, Vulkan, and CPU.
Features TurboQuant for efficient KV cache management, aiming for lower memory overhead.
Delivers a single binary deployment, simplifying setup and reducing dependency management.

Maintenance & Community

The provided README does not contain information regarding maintainers, community channels (like Discord or Slack), sponsorships, or roadmap details.

Licensing & Compatibility

The README does not explicitly state the project's license. Therefore, compatibility for commercial use or closed-source linking cannot be determined from the provided text.

Limitations & Caveats

The README focuses on the project's strengths and does not detail specific limitations, known bugs, or unsupported platforms. The absence of explicit licensing information presents a significant caveat for potential adopters.

Health Check

Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days