Synthetic data generator for LLM pretraining
Top 62.8% on sourcepulse
This project generates long-form, textbook-quality synthetic data for LLM pretraining. It's designed for researchers and developers needing to create high-quality, structured text datasets, offering flexibility in topic generation and LLM integration.
How It Works
The system orchestrates a multi-stage generation process: first, it generates topics from a seed subject or provided list. Then, it augments these topics, semantically deduplicating them. Finally, it generates full textbooks from these topics or existing outlines. A key feature is its integrated retrieval mechanism, using services like Serply or SerpAPI by default to enhance data quality, though retrieval can be disabled. The core architecture is extensible, allowing custom adaptors for new LLMs and retrieval backends.
Quick Start & Requirements
poetry install
brew install postgres
on macOS), invoke
.psql -c "create database textbook;"
, git clone
, cd textbook_quality
, poetry install
, invoke migrate-dev
.local.env
file or as environment variables.Highlighted Details
Maintenance & Community
The project is maintained by VikParuchuri. Contributions via Pull Requests are welcomed, with specific areas for extension noted in the README (LLM adaptors, retrieval methods, tasks).
Licensing & Compatibility
The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.
Limitations & Caveats
The project requires a PostgreSQL database and specific API keys for full functionality. While it supports context lengths up to 16k, optimal performance may depend on the LLM's capabilities. The README does not detail specific performance benchmarks or known limitations regarding data diversity or potential biases.
1 year ago
1 day