GPT-Fathom  by GPT-Fathom

LLM evaluation suite for open/closed-source models, reproducible research

Created 2 years ago
343 stars

Top 80.6% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GPT-Fathom is an open-source LLM evaluation suite designed for systematic, rigorous, and reproducible benchmarking of both open-source and closed-source large language models. It addresses the limitations of existing leaderboards by ensuring consistent evaluation settings and prompts, providing a standard gauge to track LLM capabilities and evolutionary paths, particularly focusing on the progression from GPT-3 to GPT-4.

How It Works

GPT-Fathom is built upon OpenAI Evals, utilizing a black-box evaluation method for all tasks. This approach involves LLMs generating free-form responses to prompts, which are then parsed to compute evaluation metrics. This method is chosen because per-token likelihoods are often unavailable for closed-source models, and it considers instruction-following capabilities. The suite evaluates over 20 curated benchmarks across 7 capability categories, supporting models from OpenAI API, Azure OpenAI Service, LLaMA, and Llama 2 families.

Quick Start & Requirements

  • Install: Clone the repository and install with pip install -e . after creating a Python 3.9+ environment.
  • Prerequisites: OpenAI API key (as OPENAI_API_KEY environment variable) for OpenAI models. For LLaMA/Llama 2, HuggingFace configuration is needed, and evaluation on a single machine typically requires 8 A100 GPUs (80GB) with tensor parallelism.
  • Running Evaluations: Use oaieval for single evaluations (e.g., oaieval gpt-3.5-turbo gsm8k-8shotCoT) or oaievalset for sets of evaluations. Azure OpenAI Service requires the --azure_eval True flag. LLaMA/Llama 2 evaluations use --eval_in_batch True.
  • More Info: quick-evals.md, run-evals.md, custom-eval.md, completion-fns.md, example.md.

Highlighted Details

  • Benchmarks 10+ leading open-source and closed-source LLMs, including OpenAI's legacy models.
  • Analyzes the impact of SFT/RLHF and pretraining with code data on LLM capabilities.
  • Investigates the "seesaw phenomenon" where model updates can improve some capabilities while degrading others.
  • Explores model sensitivity to prompt variations, number of shots, and sampling parameters.

Maintenance & Community

The project is actively updated with new model evaluations. Links to the paper and Twitter are provided for community engagement.

Licensing & Compatibility

The repository is licensed under the Apache License 2.0.

Limitations & Caveats

Evaluation results may differ from officially reported scores due to GPT-Fathom's strict adherence to aligned settings for fair comparison. The project notes that some legacy OpenAI models are scheduled for deprecation.

Health Check
Last Commit

2 years ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
0 stars in the last 30 days

Explore Similar Projects

Starred by Andrej Karpathy Andrej Karpathy(Founder of Eureka Labs; Formerly at Tesla, OpenAI; Author of CS 231n), Gabriel Almeida Gabriel Almeida(Cofounder of Langflow), and
2 more.

llm_rules by normster

0.8%
253
Benchmark for evaluating LLM rule-following capabilities
Created 2 years ago
Updated 1 year ago
Starred by Clement Delangue Clement Delangue(Cofounder of Hugging Face), Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), and
12 more.

evaluate by huggingface

0.1%
2k
ML model evaluation library for standardized performance reporting
Created 4 years ago
Updated 1 day ago
Feedback? Help us improve.