fmeval by aws

Evaluate foundation models for various NLP tasks

Created 2 years ago

275 stars

Top 94.1% on SourcePulse

Project Summary

fmeval is an open-source Python library designed for evaluating Large Language Models (LLMs) across various tasks like open-ended generation, summarization, question answering, and classification. It provides algorithms to assess LLMs for accuracy, toxicity, semantic robustness, and prompt stereotyping, enabling users to select the best LLM for their specific use cases.

How It Works

fmeval employs a modular approach using Transform and TransformPipeline objects. Transform encapsulates record-level data manipulation logic, allowing users to create custom evaluation metrics. TransformPipeline chains these Transform objects to define a sequence of operations, including prompt generation, model invocation via ModelRunner, and metric computation. This design facilitates extensibility and the creation of custom evaluation workflows.

Quick Start & Requirements

Install via pip: pip install fmeval
Requires Python 3.10.
Built-in support for Amazon SageMaker Endpoints and JumpStart models; custom ModelRunner implementations are supported.
Examples and developer guide are available in the repository.

Highlighted Details

Evaluates LLMs for Accuracy, Toxicity, Semantic Robustness, and Prompt Stereotyping.
Supports custom datasets via DataConfig.
Includes built-in ModelRunner implementations for AWS services.
Extensible architecture for custom evaluation algorithms and metrics.

Maintenance & Community

Developed by AWS.
Contribution guidelines are available in CONTRIBUTING.

Licensing & Compatibility

Licensed under the Apache-2.0 License.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

Users on Windows may encounter OSError: [Errno 0] AssignProcessToJobObject() due to Ray integration; installing Python from the official website is recommended.
Mac users might need to manually install or configure Rust for certain build steps.
Out-of-memory errors can occur with memory-intensive evaluations; PARALLELIZATION_FACTOR can be adjusted.
Telemetry for AWS-hosted LLMs is enabled by default but can be disabled.

Health Check

Last Commit

5 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

2 stars in the last 30 days