foundation-model-benchmarking-tool  by aws-samples

Benchmark any foundation model on AWS generative AI services

Created 1 year ago
252 stars

Top 99.6% on SourcePulse

GitHubView on GitHub
Project Summary

This tool addresses the need to benchmark foundation models (FMs) across various AWS generative AI services, enabling users to determine optimal price-performance and model accuracy for their workloads. It is targeted at engineers and researchers who need to evaluate and select FMs for deployment on AWS platforms like SageMaker, Bedrock, EKS, or EC2. The primary benefit is providing a standardized and flexible method for performance and accuracy testing, simplifying the decision-making process for generative AI deployments.

How It Works

FMBench operates as a Python package that can be run on any AWS platform with Python support. It utilizes configuration files to define the FM, deployment strategy (including instance types and inference containers like DeepSpeed, TensorRT, and HuggingFace TGI), and benchmarking tests. The tool supports benchmarking against models deployed directly via FMBench or through a "Bring your own endpoint" mode. It measures both performance (latency, transactions per minute) and model accuracy using a panel of LLM evaluators.

Quick Start & Requirements

  • Installation: Install via pip: uv pip install -U fmbench.
  • Prerequisites: Requires an AWS environment (EC2, SageMaker, or CloudShell) for accurate latency measurements. Python 3.12 is recommended. An S3 bucket is used for storing metrics and reports.
  • Setup: A CloudFormation template is provided for setting up an Amazon SageMaker Notebook with cloned repository and necessary S3 buckets, taking approximately 5 minutes.
  • Documentation: FMBench website

Highlighted Details

  • Benchmarks any FM (open-source, third-party, proprietary) on various AWS services (SageMaker, Bedrock, EKS, EC2) and inference stacks.
  • Supports flexible configurations for instance types (e.g., g5, p4d, p5, Inf2), inference containers, and parameters like tensor parallelism and rolling batch.
  • Includes model evaluation for accuracy using a "Panel of LLM Evaluators" (PoLL).
  • Offers an optional fmbench-orchestrator for automating benchmarking across multiple EC2 instances.

Maintenance & Community

  • Active development with recent updates including support for new models and inference engines (e.g., SGLang, vLLM).
  • Community support available via Discord.
  • Issues and feature requests can be tracked on GitHub.

Licensing & Compatibility

  • Licensed under the MIT-0 License.
  • Compatible with AWS services and various open-source inference frameworks.

Limitations & Caveats

  • Accurate performance benchmarking requires running the tool within an AWS environment to avoid internet latency impact.
  • Users are strongly recommended to use model-specific tokenizers for accurate token throughput measurements, rather than default approximations.
Health Check
Last Commit

5 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
3 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
3 more.

LitServe by Lightning-AI

0.3%
4k
AI inference pipeline framework
Created 1 year ago
Updated 2 days ago
Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Sebastián Ramírez Sebastián Ramírez(Author of FastAPI, Typer, SQLModel, Asyncer), and
1 more.

training by mlcommons

0.2%
2k
Reference implementations for MLPerf training benchmarks
Created 7 years ago
Updated 1 week ago
Feedback? Help us improve.