textbook_quality  by VikParuchuri

Synthetic data generator for LLM pretraining

created 1 year ago
502 stars

Top 62.8% on sourcepulse

GitHubView on GitHub
Project Summary

This project generates long-form, textbook-quality synthetic data for LLM pretraining. It's designed for researchers and developers needing to create high-quality, structured text datasets, offering flexibility in topic generation and LLM integration.

How It Works

The system orchestrates a multi-stage generation process: first, it generates topics from a seed subject or provided list. Then, it augments these topics, semantically deduplicating them. Finally, it generates full textbooks from these topics or existing outlines. A key feature is its integrated retrieval mechanism, using services like Serply or SerpAPI by default to enhance data quality, though retrieval can be disabled. The core architecture is extensible, allowing custom adaptors for new LLMs and retrieval backends.

Quick Start & Requirements

  • Install: poetry install
  • Prerequisites: Python 3.9+ (3.11 recommended), PostgreSQL (install via brew install postgres on macOS), invoke.
  • Setup: psql -c "create database textbook;", git clone, cd textbook_quality, poetry install, invoke migrate-dev.
  • Configuration: Set API keys (OpenAI, Serply/SerpAPI) in a local.env file or as environment variables.
  • Docs: https://github.com/VikParuchuri/textbook_quality

Highlighted Details

  • Generates up to 70M token examples.
  • Supports parallel generation and custom API endpoints (e.g., vLLM).
  • Retrieval backend is pluggable (Serply, SerpAPI, or none).
  • Caches generated content per model and topic to avoid redundant API calls.

Maintenance & Community

The project is maintained by VikParuchuri. Contributions via Pull Requests are welcomed, with specific areas for extension noted in the README (LLM adaptors, retrieval methods, tasks).

Licensing & Compatibility

The repository does not explicitly state a license in the provided README. Users should verify licensing for commercial use or integration with closed-source projects.

Limitations & Caveats

The project requires a PostgreSQL database and specific API keys for full functionality. While it supports context lengths up to 16k, optimal performance may depend on the LLM's capabilities. The README does not detail specific performance benchmarks or known limitations regarding data diversity or potential biases.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.6%
1k
Synthetic data CLI tool for LLM fine-tuning
created 4 months ago
updated 1 week ago
Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 20 hours ago
Feedback? Help us improve.