InferenceMAX by InferenceMAX

Real-time LLM inference performance benchmarking

Created 6 months ago

413 stars

Top 70.8% on SourcePulse

View on GitHub

3 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Vincent Weisser

Cofounder of Prime Intellect

Stas Bekman

Author of "Machine Learning Engineering Open Book"; Research Engineer at Snowflake

Project Summary

Summary

InferenceMAX addresses the challenge of rapidly stale benchmarks in the fast-evolving LLM inference software landscape. It provides an open-source, automated benchmarking suite that continuously re-evaluates popular inference frameworks and models nightly. This offers engineers and researchers a near real-time view of performance, enabling better adoption decisions for AI software stacks.

How It Works

The project employs an automated benchmark suite that runs nightly, capturing incremental performance gains from daily software evolutions. It focuses on key inference frameworks like SGLang, vLLM, and TensorRT-LLM, tracking their performance across different hardware. This approach leverages kernel-level optimizations, distributed strategies, and scheduling innovations to provide a dynamic performance picture, moving beyond static, point-in-time measurements.

Quick Start & Requirements

A live dashboard is publicly available at https://inferencemax.ai/. The README implies significant hardware resources (AMD and NVIDIA GPUs) are used for benchmarking, but specific installation or execution commands for the benchmark suite itself are not detailed.

Highlighted Details

Continuous Re-benchmarking: Runs benchmarks nightly to track real-time performance evolution of inference frameworks.
Live Performance Dashboard: Publicly accessible at https://inferencemax.ai/ for up-to-date insights.
Key Metrics: Monitors token throughput, performance per dollar, and tokens per Megawatt.
Framework Support: Benchmarks popular stacks including SGLang, vLLM, TensorRT-LLM, CUDA, and ROCm.
Hardware Agnostic (in principle): Acknowledges support and contributions from AMD (MI355X, CDNA3) and NVIDIA (GB200 NVL72, B200) GPUs.

Maintenance & Community

The project receives substantial support from hardware vendors (AMD, NVIDIA) and AI software teams (SGLang, vLLM, TensorRT-LLM). Compute resources are provided by partners like Crusoe, CoreWeave, and Oracle. A job posting indicates active development and industry involvement.

Licensing & Compatibility

Licensed under Apache 2.0, which is permissive for commercial use and integration into closed-source projects.

Limitations & Caveats

The provided README does not detail specific limitations, unsupported platforms, or known bugs. It focuses on the project's objective of providing continuous, transparent performance measurement.

Health Check

Last Commit

14 hours ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

24 stars in the last 30 days