Raspberry by daveshap

Open-source dataset for finetuning LLMs with reasoning

Created 1 year ago

417 stars

Top 70.3% on SourcePulse

Project Summary

Raspberry aims to create an open-source toy dataset for finetuning Large Language Models (LLMs) with enhanced reasoning abilities. Targeting researchers and developers focused on improving LLM reasoning, it offers a structured approach to generating complex queries and corresponding Chain-of-Thought (CoT) and self-critique data.

How It Works

The project synthesizes 500 distinct, complex user queries across various domains requiring math, coding, logic, and planning skills. These queries are then used to generate CoT and self-critique data via automated prompting strategies, leveraging LLMs' inherent reasoning capabilities. The generated samples undergo cleaning and rectification using rubrics and grading techniques to ensure coherence and suitability for single-shot reasoning datasets.

Quick Start & Requirements

Install: No specific installation commands are provided in the README. The project appears to be dataset generation focused.
Prerequisites: Access to LLMs capable of CoT reasoning (e.g., Claude) is implied for data synthesis.
Resources: Data synthesis and cleaning will require computational resources for running LLMs and processing text.

Highlighted Details

Focus on synthesizing complex user queries across diverse domains.
Generation of Chain-of-Thought (CoT) and self-critique data for LLM finetuning.
Goal to demonstrate near-State-of-the-Art (SOTA) performance on reasoning benchmarks.
Potential to release an open-source RL-trained model.

Maintenance & Community

The project is initiated by daveshap. Further community engagement and scaling plans are mentioned, including seeking funding via Manifund.

Licensing & Compatibility

License: MIT License.
Compatibility: Permissive MIT license allows for commercial use and integration into closed-source projects.

Limitations & Caveats

The project is described as a "toy dataset" and a "pilot" for proof of concept. Achieving near-SOTA performance is an ambitious goal for a small, toy dataset. The initial dataset size is 500 queries, which may be insufficient for robust finetuning across all targeted reasoning abilities.

Raspberry by daveshap

Explore Similar Projects

InstructionZoo by FreedomIntelligence

lumos by allenai

LLM-Synthetic-Data by pengr

loong by camel-ai

POLARIS by ChenxinAn-fdu

MetaMath by meta-math

Light-R1 by Qihoo360

ThoughtSource by OpenBioLink

Awesome-LLM-Synthetic-Data by wasiahmad

train-deepseek-r1 by FareedKhan-dev

LISA by JIA-Lab-research

llm-datasets by mlabonne