chain-of-thought-hub  by FranxYao

LLM benchmark for complex reasoning via chain-of-thought prompting

created 2 years ago
2,742 stars

Top 17.7% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive benchmark suite and leaderboard for evaluating the complex reasoning capabilities of Large Language Models (LLMs) using chain-of-thought prompting. It targets LLM researchers and developers seeking to understand and compare model performance on challenging tasks across various domains, offering insights into the practical reasoning abilities of different LLMs.

How It Works

The hub curates a diverse set of datasets categorized into "Main" (stable, widely-used benchmarks), "Experimental" (emerging tasks), and "Long-Context" (evaluating reasoning over extended text). It focuses on chain-of-thought prompting, believing it to be a critical "system call" for future LLM applications, and aims to differentiate models based on their ability to handle complex reasoning rather than general conversational fluency.

Quick Start & Requirements

  • Installation: Navigate to specific dataset directories (e.g., cd MMLU, cd gsm8k, cd BBH) and run provided Python scripts or Jupyter notebooks.
  • Prerequisites: Python, API keys for proprietary models (e.g., OpenAI, Anthropic), and potentially specific model checkpoints for open-source evaluations.
  • Resources: Running evaluations requires computational resources, especially for larger models and datasets. Specific setup times are not detailed but expect hours for full benchmark runs.
  • Links: Paper, Blog, Twitter, [List of datasets]([List of datasets we consider]), [Call for contribution]([Call for contribution]).

Highlighted Details

  • Comprehensive benchmarking across math (GSM8K, MATH), science (TheoremQA), symbolic reasoning (BBH), knowledge (MMLU, C-Eval), coding (HumanEval), factual reasoning (SummEdits), and long-context tasks (Qspr, QALT, BkSS).
  • Directly challenges claims of smaller models matching larger ones by focusing on complex reasoning, where differences are more pronounced.
  • Provides detailed leaderboards comparing numerous LLMs (GPT-4, Claude, LLaMA, Mistral, Gemini, etc.) across different task categories.
  • Includes evaluation scripts and methodology, encouraging community reproduction and contribution.

Maintenance & Community

  • Regular updates adding new models (e.g., Gemini, Yi, DeepSeek) and benchmark categories (e.g., Long Context).
  • Actively seeks community contributions for new tasks, models, and benchmark data.
  • Twitter handle for updates.

Licensing & Compatibility

  • The repository itself appears to be under an unspecified license, but the included datasets and evaluation scripts may have their own licenses. Users should verify compatibility for commercial use.

Limitations & Caveats

  • The sensibility of model performance is noted as high due to the nature of LLMs, with ongoing efforts to standardize prompts.
  • Some benchmark results may not be strictly "few-shot" if models were trained on the evaluation data splits (e.g., GPT-4 on GSM8K).
  • The repository focuses on reasoning; other aspects like safety or conversational ability are not primary evaluation criteria.
Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
34 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.