GPT-Fathom by GPT-Fathom

LLM evaluation suite for open/closed-source models, reproducible research

Created 2 years ago

346 stars

Top 80.2% on SourcePulse

View on GitHub

1 Expert Loves This Project

Ishaan Jaffer

Cofounder of LiteLLM

Project Summary

GPT-Fathom is an open-source LLM evaluation suite designed for systematic, rigorous, and reproducible benchmarking of both open-source and closed-source large language models. It addresses the limitations of existing leaderboards by ensuring consistent evaluation settings and prompts, providing a standard gauge to track LLM capabilities and evolutionary paths, particularly focusing on the progression from GPT-3 to GPT-4.

How It Works

GPT-Fathom is built upon OpenAI Evals, utilizing a black-box evaluation method for all tasks. This approach involves LLMs generating free-form responses to prompts, which are then parsed to compute evaluation metrics. This method is chosen because per-token likelihoods are often unavailable for closed-source models, and it considers instruction-following capabilities. The suite evaluates over 20 curated benchmarks across 7 capability categories, supporting models from OpenAI API, Azure OpenAI Service, LLaMA, and Llama 2 families.

Quick Start & Requirements

Install: Clone the repository and install with pip install -e . after creating a Python 3.9+ environment.
Prerequisites: OpenAI API key (as OPENAI_API_KEY environment variable) for OpenAI models. For LLaMA/Llama 2, HuggingFace configuration is needed, and evaluation on a single machine typically requires 8 A100 GPUs (80GB) with tensor parallelism.
Running Evaluations: Use oaieval for single evaluations (e.g., oaieval gpt-3.5-turbo gsm8k-8shotCoT) or oaievalset for sets of evaluations. Azure OpenAI Service requires the --azure_eval True flag. LLaMA/Llama 2 evaluations use --eval_in_batch True.
More Info: quick-evals.md, run-evals.md, custom-eval.md, completion-fns.md, example.md.

Highlighted Details

Benchmarks 10+ leading open-source and closed-source LLMs, including OpenAI's legacy models.
Analyzes the impact of SFT/RLHF and pretraining with code data on LLM capabilities.
Investigates the "seesaw phenomenon" where model updates can improve some capabilities while degrading others.
Explores model sensitivity to prompt variations, number of shots, and sampling parameters.

Maintenance & Community

The project is actively updated with new model evaluations. Links to the paper and Twitter are provided for community engagement.

Licensing & Compatibility

The repository is licensed under the Apache License 2.0.

Limitations & Caveats

Evaluation results may differ from officially reported scores due to GPT-Fathom's strict adherence to aligned settings for fair comparison. The project notes that some legacy OpenAI models are scheduled for deprecation.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

0 stars in the last 30 days