fmeval  by aws

Evaluate foundation models for various NLP tasks

created 1 year ago
258 stars

Top 98.6% on sourcepulse

GitHubView on GitHub
Project Summary

fmeval is an open-source Python library designed for evaluating Large Language Models (LLMs) across various tasks like open-ended generation, summarization, question answering, and classification. It provides algorithms to assess LLMs for accuracy, toxicity, semantic robustness, and prompt stereotyping, enabling users to select the best LLM for their specific use cases.

How It Works

fmeval employs a modular approach using Transform and TransformPipeline objects. Transform encapsulates record-level data manipulation logic, allowing users to create custom evaluation metrics. TransformPipeline chains these Transform objects to define a sequence of operations, including prompt generation, model invocation via ModelRunner, and metric computation. This design facilitates extensibility and the creation of custom evaluation workflows.

Quick Start & Requirements

  • Install via pip: pip install fmeval
  • Requires Python 3.10.
  • Built-in support for Amazon SageMaker Endpoints and JumpStart models; custom ModelRunner implementations are supported.
  • Examples and developer guide are available in the repository.

Highlighted Details

  • Evaluates LLMs for Accuracy, Toxicity, Semantic Robustness, and Prompt Stereotyping.
  • Supports custom datasets via DataConfig.
  • Includes built-in ModelRunner implementations for AWS services.
  • Extensible architecture for custom evaluation algorithms and metrics.

Maintenance & Community

  • Developed by AWS.
  • Contribution guidelines are available in CONTRIBUTING.

Licensing & Compatibility

  • Licensed under the Apache-2.0 License.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

  • Users on Windows may encounter OSError: [Errno 0] AssignProcessToJobObject() due to Ray integration; installing Python from the official website is recommended.
  • Mac users might need to manually install or configure Rust for certain build steps.
  • Out-of-memory errors can occur with memory-intensive evaluations; PARALLELIZATION_FACTOR can be adjusted.
  • Telemetry for AWS-hosted LLMs is enabled by default but can be disabled.
Health Check
Last commit

2 weeks ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Jerry Liu Jerry Liu(Cofounder of LlamaIndex).

deepeval by confident-ai

2.0%
10k
LLM evaluation framework for unit testing LLM outputs
created 2 years ago
updated 15 hours ago
Feedback? Help us improve.