GraphGen by open-sciencelab

Framework for LLM fine-tuning with knowledge-driven synthetic data

Created 10 months ago

577 stars

Top 56.0% on SourcePulse

View on GitHub

1 Expert Loves This Project

Yaowei Zheng

Author of LLaMA-Factory

Project Summary

GraphGen is a framework for generating synthetic data to enhance supervised fine-tuning of Large Language Models (LLMs). It targets researchers and practitioners aiming to improve LLM performance, particularly in knowledge-intensive domains, by addressing knowledge gaps through targeted data creation.

How It Works

GraphGen constructs fine-grained knowledge graphs from source text to identify LLM knowledge gaps using expected calibration error. It prioritizes generating question-answering pairs for high-value, long-tail knowledge, employing multi-hop neighborhood sampling for complex relationships and style-controlled generation for data diversification.

Quick Start & Requirements

Installation: Install uv (package manager), clone the repository, create a uv environment (uv venv --python 3.10), and install dependencies (uv pip install -r requirements.txt).
Running Demo: Execute uv run webui/app.py for a Gradio demo.
CLI Usage: Requires setting environment variables for synthesizer and trainee models (e.g., SYNTHESIZER_MODEL, TRAINEE_MODEL) and their base URLs/API keys. Run with graphg --output_dir cache.
Docker: Build image with docker build -t graphgen . and run with docker run -p 7860:7860 graphgen.
Prerequisites: Python 3.10, uv package manager. API keys for LLM models are required for generation.

Highlighted Details

Supports Google, Bing, Wikipedia, and UniProt as search back-ends for data gap filling.
Claims over 50% SFT data from GraphGen improves performance on benchmarks like SeedBench and GPQA-Diamond.
Offers a web UI for interactive use.

Maintenance & Community

Initial version released April 21, 2025.
Project acknowledges contributions from SiliconFlow and libraries like LightRAG and ROGRAG.
Community support via opening issues or WeChat groups.

Licensing & Compatibility

Licensed under the Apache License 2.0.
Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The framework relies on external LLM APIs for data generation, requiring API keys and potentially incurring costs. Performance gains are benchmark-dependent and may vary based on the quality of the knowledge graph construction and the chosen LLM models.

Health Check

Last Commit

4 days ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

117 stars in the last 30 days