foundation-model-benchmarking-tool by aws-samples

Benchmark any foundation model on AWS generative AI services

Created 2 years ago

254 stars

Top 99.1% on SourcePulse

Project Summary

This tool addresses the need to benchmark foundation models (FMs) across various AWS generative AI services, enabling users to determine optimal price-performance and model accuracy for their workloads. It is targeted at engineers and researchers who need to evaluate and select FMs for deployment on AWS platforms like SageMaker, Bedrock, EKS, or EC2. The primary benefit is providing a standardized and flexible method for performance and accuracy testing, simplifying the decision-making process for generative AI deployments.

How It Works

FMBench operates as a Python package that can be run on any AWS platform with Python support. It utilizes configuration files to define the FM, deployment strategy (including instance types and inference containers like DeepSpeed, TensorRT, and HuggingFace TGI), and benchmarking tests. The tool supports benchmarking against models deployed directly via FMBench or through a "Bring your own endpoint" mode. It measures both performance (latency, transactions per minute) and model accuracy using a panel of LLM evaluators.

Quick Start & Requirements

Installation: Install via pip: uv pip install -U fmbench.
Prerequisites: Requires an AWS environment (EC2, SageMaker, or CloudShell) for accurate latency measurements. Python 3.12 is recommended. An S3 bucket is used for storing metrics and reports.
Setup: A CloudFormation template is provided for setting up an Amazon SageMaker Notebook with cloned repository and necessary S3 buckets, taking approximately 5 minutes.
Documentation: FMBench website

Highlighted Details

Benchmarks any FM (open-source, third-party, proprietary) on various AWS services (SageMaker, Bedrock, EKS, EC2) and inference stacks.
Supports flexible configurations for instance types (e.g., g5, p4d, p5, Inf2), inference containers, and parameters like tensor parallelism and rolling batch.
Includes model evaluation for accuracy using a "Panel of LLM Evaluators" (PoLL).
Offers an optional fmbench-orchestrator for automating benchmarking across multiple EC2 instances.

Maintenance & Community

Active development with recent updates including support for new models and inference engines (e.g., SGLang, vLLM).
Community support available via Discord.
Issues and feature requests can be tracked on GitHub.

Licensing & Compatibility

Licensed under the MIT-0 License.
Compatible with AWS services and various open-source inference frameworks.

Limitations & Caveats

Accurate performance benchmarking requires running the tool within an AWS environment to avoid internet latency impact.
Users are strongly recommended to use model-specific tokenizers for accurate token throughput measurements, rather than default approximations.

Health Check

Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days