ThoughtSource by OpenBioLink

Framework for chain-of-thought reasoning data and tools

Created 3 years ago

1,014 stars

Top 36.6% on SourcePulse

View on GitHub

4 Experts Love This Project

Vincent Weisser

Cofounder of Prime Intellect

Wing Lian

Founder of Axolotl AI

Elvis Saravia

Founder of DAIR.AI

Jeff Hammerbacher

Cofounder of Cloudera

Project Summary

ThoughtSource provides a centralized, open resource for chain-of-thought (CoT) reasoning data and tools, aiming to foster trustworthy AI for scientific research and medical practice. It targets researchers and developers working with large language models (LLMs) to improve their reasoning capabilities.

How It Works

The framework standardizes CoT data using the Hugging Face 🤗 Datasets format, enabling access to diverse datasets like CommonsenseQA, StrategyQA, QED, WorldTree, and medical/math-specific QA sets. It supports both human-generated and AI-generated reasoning chains, offering post-processing for coherence. The library includes modules for data loading, CoT generation using various LLMs (OpenAI, Hugging Face Hub), and performance evaluation.

Quick Start & Requirements

Install: pip install -e ./libs/cot[api] after cloning the repository and setting up a Python virtual environment.
Prerequisites: Python 3.x, pip, venv.
Resources: Requires downloading datasets; specific LLM generation may incur API costs.
Docs: Tutorial notebook

Highlighted Details

Supports 15+ diverse datasets across general QA, scientific/medical QA, and math word problems.
Includes a web-based annotator for comparing and analyzing reasoning chains.
Offers functionality to generate CoTs using multiple LLMs and evaluate their performance.
Features pre-compiled collections like ThoughtSource_33 for efficient evaluation.

Maintenance & Community

The project is developed by the Samwald research group. Updates are tracked in the versioning section. Community contributions and dataset suggestions are welcomed.

Licensing & Compatibility

Licenses vary by dataset, including MIT, CC BY-SA 3.0, Apache 2.0, CC BY 4.0, AI2 Mercury, and CC BY-NC 4.0. Some AI-generated data licenses are listed as "Unknown." Compatibility for commercial use depends on the specific dataset licenses.

Limitations & Caveats

Some AI-generated reasoning chains have unknown licenses. The project is actively developed, with ongoing efforts to improve dataset quality and expand coverage.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

3 stars in the last 30 days