GraphGen  by open-sciencelab

Framework for LLM fine-tuning with knowledge-driven synthetic data

Created 9 months ago
408 stars

Top 71.4% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GraphGen is a framework for generating synthetic data to enhance supervised fine-tuning of Large Language Models (LLMs). It targets researchers and practitioners aiming to improve LLM performance, particularly in knowledge-intensive domains, by addressing knowledge gaps through targeted data creation.

How It Works

GraphGen constructs fine-grained knowledge graphs from source text to identify LLM knowledge gaps using expected calibration error. It prioritizes generating question-answering pairs for high-value, long-tail knowledge, employing multi-hop neighborhood sampling for complex relationships and style-controlled generation for data diversification.

Quick Start & Requirements

  • Installation: Install uv (package manager), clone the repository, create a uv environment (uv venv --python 3.10), and install dependencies (uv pip install -r requirements.txt).
  • Running Demo: Execute uv run webui/app.py for a Gradio demo.
  • CLI Usage: Requires setting environment variables for synthesizer and trainee models (e.g., SYNTHESIZER_MODEL, TRAINEE_MODEL) and their base URLs/API keys. Run with graphg --output_dir cache.
  • Docker: Build image with docker build -t graphgen . and run with docker run -p 7860:7860 graphgen.
  • Prerequisites: Python 3.10, uv package manager. API keys for LLM models are required for generation.

Highlighted Details

  • Supports Google, Bing, Wikipedia, and UniProt as search back-ends for data gap filling.
  • Claims over 50% SFT data from GraphGen improves performance on benchmarks like SeedBench and GPQA-Diamond.
  • Offers a web UI for interactive use.

Maintenance & Community

  • Initial version released April 21, 2025.
  • Project acknowledges contributions from SiliconFlow and libraries like LightRAG and ROGRAG.
  • Community support via opening issues or WeChat groups.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The framework relies on external LLM APIs for data generation, requiring API keys and potentially incurring costs. Performance gains are benchmark-dependent and may vary based on the quality of the knowledge graph construction and the chosen LLM models.

Health Check
Last Commit

6 hours ago

Responsiveness

Inactive

Pull Requests (30d)
14
Issues (30d)
1
Star History
63 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Daniel Han Daniel Han(Cofounder of Unsloth), and
1 more.

synthetic-data-kit by meta-llama

1.9%
1k
Synthetic data CLI tool for LLM fine-tuning
Created 6 months ago
Updated 2 weeks ago
Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Nir Gazit Nir Gazit(Cofounder of Traceloop), and
4 more.

llmware by llmware-ai

0.1%
14k
Framework for enterprise RAG pipelines using small, specialized models
Created 2 years ago
Updated 2 months ago
Feedback? Help us improve.