fmeval  by aws

Evaluate foundation models for various NLP tasks

Created 2 years ago
264 stars

Top 96.8% on SourcePulse

GitHubView on GitHub
Project Summary

fmeval is an open-source Python library designed for evaluating Large Language Models (LLMs) across various tasks like open-ended generation, summarization, question answering, and classification. It provides algorithms to assess LLMs for accuracy, toxicity, semantic robustness, and prompt stereotyping, enabling users to select the best LLM for their specific use cases.

How It Works

fmeval employs a modular approach using Transform and TransformPipeline objects. Transform encapsulates record-level data manipulation logic, allowing users to create custom evaluation metrics. TransformPipeline chains these Transform objects to define a sequence of operations, including prompt generation, model invocation via ModelRunner, and metric computation. This design facilitates extensibility and the creation of custom evaluation workflows.

Quick Start & Requirements

  • Install via pip: pip install fmeval
  • Requires Python 3.10.
  • Built-in support for Amazon SageMaker Endpoints and JumpStart models; custom ModelRunner implementations are supported.
  • Examples and developer guide are available in the repository.

Highlighted Details

  • Evaluates LLMs for Accuracy, Toxicity, Semantic Robustness, and Prompt Stereotyping.
  • Supports custom datasets via DataConfig.
  • Includes built-in ModelRunner implementations for AWS services.
  • Extensible architecture for custom evaluation algorithms and metrics.

Maintenance & Community

  • Developed by AWS.
  • Contribution guidelines are available in CONTRIBUTING.

Licensing & Compatibility

  • Licensed under the Apache-2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Users on Windows may encounter OSError: [Errno 0] AssignProcessToJobObject() due to Ray integration; installing Python from the official website is recommended.
  • Mac users might need to manually install or configure Rust for certain build steps.
  • Out-of-memory errors can occur with memory-intensive evaluations; PARALLELIZATION_FACTOR can be adjusted.
  • Telemetry for AWS-hosted LLMs is enabled by default but can be disabled.
Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
0
Star History
2 stars in the last 30 days

Explore Similar Projects

Starred by Morgan Funtowicz Morgan Funtowicz(Head of ML Optimizations at Hugging Face), Luis Capelo Luis Capelo(Cofounder of Lightning AI), and
7 more.

lighteval by huggingface

2.6%
2k
LLM evaluation toolkit for multiple backends
Created 1 year ago
Updated 1 day ago
Feedback? Help us improve.