open-thoughts by open-thoughts

Open dataset for training reasoning models

Created 11 months ago

2,183 stars

Top 20.4% on SourcePulse

View on GitHub

6 Experts Love This Project

Jeff Hammerbacher

Cofounder of Cloudera

Vincent Weisser

Cofounder of Prime Intellect

Shizhe Diao

Author of LMFlow; Research Scientist at NVIDIA

Yaowei Zheng

Author of LLaMA-Factory

and 2 more!

Project Summary

This repository provides fully open data curation for reasoning models, targeting researchers and developers aiming to train state-of-the-art small reasoning models. It offers curated datasets and models that claim to surpass existing benchmarks in math and code reasoning.

How It Works

The project focuses on generating and curating high-quality reasoning datasets, such as OpenThoughts2-1M and OpenThoughts-114k. These datasets are created through systematic ablation studies on various question generation methodologies, sampling from the highest-performing approaches. This data-centric approach aims to improve the reasoning capabilities of language models.

Quick Start & Requirements

Install: make install
Dependencies: Poetry, DeepSeek API key, Hugging Face organization ID.
Links: Open Thoughts GitHub Repository, OpenThoughts2-1M dataset, OpenThinker2-7B model, OpenThinker2-32B model

Highlighted Details

OpenThinker2-32B model achieves state-of-the-art performance on reasoning benchmarks like AIME, AMC, and MATH500.
OpenThoughts2-1M dataset is the #1 trending dataset on Hugging Face.
Models are available on Ollama for easy local inference.
Full open-source commitment: model weights, datasets, data generation, evaluation, and training code are public.

Maintenance & Community

The project is a collaboration led by Bespoke Labs and the DataComp community, with contributions from researchers at multiple universities and institutions. It has a growing community with over 190 public models on Hugging Face trained using its datasets.

Licensing & Compatibility

The repository states it is "fully open-source" and lists model weights, datasets, and code as publicly available. Specific license details are not explicitly stated in the README, but the emphasis on open access suggests permissive licensing suitable for commercial use and closed-source linking.

Limitations & Caveats

Training and evaluation code are listed as "coming soon," indicating that the full pipeline for reproducing or extending the results may not yet be available.

Health Check

Last Commit

1 month ago

Responsiveness

1 day

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days