chain-of-thought-hub by FranxYao

LLM benchmark for complex reasoning via chain-of-thought prompting

Created 2 years ago

2,765 stars

Top 17.0% on SourcePulse

View on GitHub

6 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Jeff Hammerbacher

Cofounder of Cloudera

and 2 more!

Project Summary

This repository provides a comprehensive benchmark suite and leaderboard for evaluating the complex reasoning capabilities of Large Language Models (LLMs) using chain-of-thought prompting. It targets LLM researchers and developers seeking to understand and compare model performance on challenging tasks across various domains, offering insights into the practical reasoning abilities of different LLMs.

How It Works

The hub curates a diverse set of datasets categorized into "Main" (stable, widely-used benchmarks), "Experimental" (emerging tasks), and "Long-Context" (evaluating reasoning over extended text). It focuses on chain-of-thought prompting, believing it to be a critical "system call" for future LLM applications, and aims to differentiate models based on their ability to handle complex reasoning rather than general conversational fluency.

Quick Start & Requirements

Installation: Navigate to specific dataset directories (e.g., cd MMLU, cd gsm8k, cd BBH) and run provided Python scripts or Jupyter notebooks.
Prerequisites: Python, API keys for proprietary models (e.g., OpenAI, Anthropic), and potentially specific model checkpoints for open-source evaluations.
Resources: Running evaluations requires computational resources, especially for larger models and datasets. Specific setup times are not detailed but expect hours for full benchmark runs.
Links: Paper, Blog, Twitter, [List of datasets]([List of datasets we consider]), [Call for contribution]([Call for contribution]).

Highlighted Details

Comprehensive benchmarking across math (GSM8K, MATH), science (TheoremQA), symbolic reasoning (BBH), knowledge (MMLU, C-Eval), coding (HumanEval), factual reasoning (SummEdits), and long-context tasks (Qspr, QALT, BkSS).
Directly challenges claims of smaller models matching larger ones by focusing on complex reasoning, where differences are more pronounced.
Provides detailed leaderboards comparing numerous LLMs (GPT-4, Claude, LLaMA, Mistral, Gemini, etc.) across different task categories.
Includes evaluation scripts and methodology, encouraging community reproduction and contribution.

Maintenance & Community

Regular updates adding new models (e.g., Gemini, Yi, DeepSeek) and benchmark categories (e.g., Long Context).
Actively seeks community contributions for new tasks, models, and benchmark data.
Twitter handle for updates.

Licensing & Compatibility

The repository itself appears to be under an unspecified license, but the included datasets and evaluation scripts may have their own licenses. Users should verify compatibility for commercial use.

Limitations & Caveats

The sensibility of model performance is noted as high due to the nature of LLMs, with ongoing efforts to standardize prompts.
Some benchmark results may not be strictly "few-shot" if models were trained on the evaluation data splits (e.g., GPT-4 on GSM8K).
The repository focuses on reasoning; other aspects like safety or conversational ability are not primary evaluation criteria.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

4 stars in the last 30 days