Framework for LLM fine-tuning with knowledge-driven synthetic data
Top 88.2% on SourcePulse
GraphGen is a framework for generating synthetic data to enhance supervised fine-tuning of Large Language Models (LLMs). It targets researchers and practitioners aiming to improve LLM performance, particularly in knowledge-intensive domains, by addressing knowledge gaps through targeted data creation.
How It Works
GraphGen constructs fine-grained knowledge graphs from source text to identify LLM knowledge gaps using expected calibration error. It prioritizes generating question-answering pairs for high-value, long-tail knowledge, employing multi-hop neighborhood sampling for complex relationships and style-controlled generation for data diversification.
Quick Start & Requirements
uv
(package manager), clone the repository, create a uv
environment (uv venv --python 3.10
), and install dependencies (uv pip install -r requirements.txt
).uv run webui/app.py
for a Gradio demo.SYNTHESIZER_MODEL
, TRAINEE_MODEL
) and their base URLs/API keys. Run with graphg --output_dir cache
.docker build -t graphgen .
and run with docker run -p 7860:7860 graphgen
.uv
package manager. API keys for LLM models are required for generation.Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
The framework relies on external LLM APIs for data generation, requiring API keys and potentially incurring costs. Performance gains are benchmark-dependent and may vary based on the quality of the knowledge graph construction and the chosen LLM models.
2 days ago
Inactive