GPT-Fathom  by GPT-Fathom

LLM evaluation suite for open/closed-source models, reproducible research

Created 2 years ago
347 stars

Top 80.0% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GPT-Fathom is an open-source LLM evaluation suite designed for systematic, rigorous, and reproducible benchmarking of both open-source and closed-source large language models. It addresses the limitations of existing leaderboards by ensuring consistent evaluation settings and prompts, providing a standard gauge to track LLM capabilities and evolutionary paths, particularly focusing on the progression from GPT-3 to GPT-4.

How It Works

GPT-Fathom is built upon OpenAI Evals, utilizing a black-box evaluation method for all tasks. This approach involves LLMs generating free-form responses to prompts, which are then parsed to compute evaluation metrics. This method is chosen because per-token likelihoods are often unavailable for closed-source models, and it considers instruction-following capabilities. The suite evaluates over 20 curated benchmarks across 7 capability categories, supporting models from OpenAI API, Azure OpenAI Service, LLaMA, and Llama 2 families.

Quick Start & Requirements

  • Install: Clone the repository and install with pip install -e . after creating a Python 3.9+ environment.
  • Prerequisites: OpenAI API key (as OPENAI_API_KEY environment variable) for OpenAI models. For LLaMA/Llama 2, HuggingFace configuration is needed, and evaluation on a single machine typically requires 8 A100 GPUs (80GB) with tensor parallelism.
  • Running Evaluations: Use oaieval for single evaluations (e.g., oaieval gpt-3.5-turbo gsm8k-8shotCoT) or oaievalset for sets of evaluations. Azure OpenAI Service requires the --azure_eval True flag. LLaMA/Llama 2 evaluations use --eval_in_batch True.
  • More Info: quick-evals.md, run-evals.md, custom-eval.md, completion-fns.md, example.md.

Highlighted Details

  • Benchmarks 10+ leading open-source and closed-source LLMs, including OpenAI's legacy models.
  • Analyzes the impact of SFT/RLHF and pretraining with code data on LLM capabilities.
  • Investigates the "seesaw phenomenon" where model updates can improve some capabilities while degrading others.
  • Explores model sensitivity to prompt variations, number of shots, and sampling parameters.

Maintenance & Community

The project is actively updated with new model evaluations. Links to the paper and Twitter are provided for community engagement.

Licensing & Compatibility

The repository is licensed under the Apache License 2.0.

Limitations & Caveats

Evaluation results may differ from officially reported scores due to GPT-Fathom's strict adherence to aligned settings for fair comparison. The project notes that some legacy OpenAI models are scheduled for deprecation.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.1%
2k
ML model evaluation library for standardized performance reporting
Created 3 years ago
Updated 1 month ago
Feedback? Help us improve.