textbook_quality  by VikParuchuri

Synthetic data generator for LLM pretraining

Created 2 years ago
505 stars

Top 61.7% on SourcePulse

GitHubView on GitHub
Project Summary

This project generates long-form, textbook-quality synthetic data for LLM pretraining. It's designed for researchers and developers needing to create high-quality, structured text datasets, offering flexibility in topic generation and LLM integration.

How It Works

The system orchestrates a multi-stage generation process: first, it generates topics from a seed subject or provided list. Then, it augments these topics, semantically deduplicating them. Finally, it generates full textbooks from these topics or existing outlines. A key feature is its integrated retrieval mechanism, using services like Serply or SerpAPI by default to enhance data quality, though retrieval can be disabled. The core architecture is extensible, allowing custom adaptors for new LLMs and retrieval backends.

Quick Start & Requirements

  • Install: poetry install
  • Prerequisites: Python 3.9+ (3.11 recommended), PostgreSQL (install via brew install postgres on macOS), invoke.
  • Setup: psql -c "create database textbook;", git clone, cd textbook_quality, poetry install, invoke migrate-dev.
  • Configuration: Set API keys (OpenAI, Serply/SerpAPI) in a local.env file or as environment variables.
  • Docs: https://github.com/VikParuchuri/textbook_quality

Highlighted Details

  • Generates up to 70M token examples.
  • Supports parallel generation and custom API endpoints (e.g., vLLM).
  • Retrieval backend is pluggable (Serply, SerpAPI, or none).
  • Caches generated content per model and topic to avoid redundant API calls.

Maintenance & Community

The project is maintained by VikParuchuri. Contributions via Pull Requests are welcomed, with specific areas for extension noted in the README (LLM adaptors, retrieval methods, tasks).

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project requires a PostgreSQL database and specific API keys for full functionality. While it supports context lengths up to 16k, optimal performance may depend on the LLM's capabilities. The README does not detail specific performance benchmarks or known limitations regarding data diversity or potential biases.

Health Check
Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)
0
Issues (30d)
0
Star History
1 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.