ThoughtSource  by OpenBioLink

Framework for chain-of-thought reasoning data and tools

Created 3 years ago
995 stars

Top 37.4% on SourcePulse

GitHubView on GitHub
Project Summary

ThoughtSource provides a centralized, open resource for chain-of-thought (CoT) reasoning data and tools, aiming to foster trustworthy AI for scientific research and medical practice. It targets researchers and developers working with large language models (LLMs) to improve their reasoning capabilities.

How It Works

The framework standardizes CoT data using the Hugging Face 🤗 Datasets format, enabling access to diverse datasets like CommonsenseQA, StrategyQA, QED, WorldTree, and medical/math-specific QA sets. It supports both human-generated and AI-generated reasoning chains, offering post-processing for coherence. The library includes modules for data loading, CoT generation using various LLMs (OpenAI, Hugging Face Hub), and performance evaluation.

Quick Start & Requirements

  • Install: pip install -e ./libs/cot[api] after cloning the repository and setting up a Python virtual environment.
  • Prerequisites: Python 3.x, pip, venv.
  • Resources: Requires downloading datasets; specific LLM generation may incur API costs.
  • Docs: Tutorial notebook

Highlighted Details

  • Supports 15+ diverse datasets across general QA, scientific/medical QA, and math word problems.
  • Includes a web-based annotator for comparing and analyzing reasoning chains.
  • Offers functionality to generate CoTs using multiple LLMs and evaluate their performance.
  • Features pre-compiled collections like ThoughtSource_33 for efficient evaluation.

Maintenance & Community

The project is developed by the Samwald research group. Updates are tracked in the versioning section. Community contributions and dataset suggestions are welcomed.

Licensing & Compatibility

Licenses vary by dataset, including MIT, CC BY-SA 3.0, Apache 2.0, CC BY 4.0, AI2 Mercury, and CC BY-NC 4.0. Some AI-generated data licenses are listed as "Unknown." Compatibility for commercial use depends on the specific dataset licenses.

Limitations & Caveats

Some AI-generated reasoning chains have unknown licenses. The project is actively developed, with ongoing efforts to improve dataset quality and expand coverage.

Health Check
Last Commit

9 months ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
8 stars in the last 30 days

Explore Similar Projects

Starred by Shizhe Diao Shizhe Diao(Author of LMFlow; Research Scientist at NVIDIA), Eric Zhu Eric Zhu(Coauthor of AutoGen; Research Scientist at Microsoft Research), and
7 more.

reasoning-gym by open-thought

1.2%
1k
Procedural dataset generator for reasoning models
Created 7 months ago
Updated 3 days ago
Feedback? Help us improve.