Open dataset for training reasoning models
Top 22.4% on sourcepulse
This repository provides fully open data curation for reasoning models, targeting researchers and developers aiming to train state-of-the-art small reasoning models. It offers curated datasets and models that claim to surpass existing benchmarks in math and code reasoning.
How It Works
The project focuses on generating and curating high-quality reasoning datasets, such as OpenThoughts2-1M and OpenThoughts-114k. These datasets are created through systematic ablation studies on various question generation methodologies, sampling from the highest-performing approaches. This data-centric approach aims to improve the reasoning capabilities of language models.
Quick Start & Requirements
make install
Highlighted Details
Maintenance & Community
The project is a collaboration led by Bespoke Labs and the DataComp community, with contributions from researchers at multiple universities and institutions. It has a growing community with over 190 public models on Hugging Face trained using its datasets.
Licensing & Compatibility
The repository states it is "fully open-source" and lists model weights, datasets, and code as publicly available. Specific license details are not explicitly stated in the README, but the emphasis on open access suggests permissive licensing suitable for commercial use and closed-source linking.
Limitations & Caveats
Training and evaluation code are listed as "coming soon," indicating that the full pipeline for reproducing or extending the results may not yet be available.
1 week ago
1 week