Discover and explore top open-source AI tools and projects—updated daily.
LLM evaluation suite for open/closed-source models, reproducible research
Top 80.0% on SourcePulse
GPT-Fathom is an open-source LLM evaluation suite designed for systematic, rigorous, and reproducible benchmarking of both open-source and closed-source large language models. It addresses the limitations of existing leaderboards by ensuring consistent evaluation settings and prompts, providing a standard gauge to track LLM capabilities and evolutionary paths, particularly focusing on the progression from GPT-3 to GPT-4.
How It Works
GPT-Fathom is built upon OpenAI Evals, utilizing a black-box evaluation method for all tasks. This approach involves LLMs generating free-form responses to prompts, which are then parsed to compute evaluation metrics. This method is chosen because per-token likelihoods are often unavailable for closed-source models, and it considers instruction-following capabilities. The suite evaluates over 20 curated benchmarks across 7 capability categories, supporting models from OpenAI API, Azure OpenAI Service, LLaMA, and Llama 2 families.
Quick Start & Requirements
pip install -e .
after creating a Python 3.9+ environment.OPENAI_API_KEY
environment variable) for OpenAI models. For LLaMA/Llama 2, HuggingFace configuration is needed, and evaluation on a single machine typically requires 8 A100 GPUs (80GB) with tensor parallelism.oaieval
for single evaluations (e.g., oaieval gpt-3.5-turbo gsm8k-8shotCoT
) or oaievalset
for sets of evaluations. Azure OpenAI Service requires the --azure_eval True
flag. LLaMA/Llama 2 evaluations use --eval_in_batch True
.Highlighted Details
Maintenance & Community
The project is actively updated with new model evaluations. Links to the paper and Twitter are provided for community engagement.
Licensing & Compatibility
The repository is licensed under the Apache License 2.0.
Limitations & Caveats
Evaluation results may differ from officially reported scores due to GPT-Fathom's strict adherence to aligned settings for fair comparison. The project notes that some legacy OpenAI models are scheduled for deprecation.
1 year ago
Inactive