GraphGen  by open-sciencelab

Framework for LLM fine-tuning with knowledge-driven synthetic data

created 7 months ago
302 stars

Top 88.2% on SourcePulse

GitHubView on GitHub
1 Expert Loves This Project
Project Summary

GraphGen is a framework for generating synthetic data to enhance supervised fine-tuning of Large Language Models (LLMs). It targets researchers and practitioners aiming to improve LLM performance, particularly in knowledge-intensive domains, by addressing knowledge gaps through targeted data creation.

How It Works

GraphGen constructs fine-grained knowledge graphs from source text to identify LLM knowledge gaps using expected calibration error. It prioritizes generating question-answering pairs for high-value, long-tail knowledge, employing multi-hop neighborhood sampling for complex relationships and style-controlled generation for data diversification.

Quick Start & Requirements

  • Installation: Install uv (package manager), clone the repository, create a uv environment (uv venv --python 3.10), and install dependencies (uv pip install -r requirements.txt).
  • Running Demo: Execute uv run webui/app.py for a Gradio demo.
  • CLI Usage: Requires setting environment variables for synthesizer and trainee models (e.g., SYNTHESIZER_MODEL, TRAINEE_MODEL) and their base URLs/API keys. Run with graphg --output_dir cache.
  • Docker: Build image with docker build -t graphgen . and run with docker run -p 7860:7860 graphgen.
  • Prerequisites: Python 3.10, uv package manager. API keys for LLM models are required for generation.

Highlighted Details

  • Supports Google, Bing, Wikipedia, and UniProt as search back-ends for data gap filling.
  • Claims over 50% SFT data from GraphGen improves performance on benchmarks like SeedBench and GPQA-Diamond.
  • Offers a web UI for interactive use.

Maintenance & Community

  • Initial version released April 21, 2025.
  • Project acknowledges contributions from SiliconFlow and libraries like LightRAG and ROGRAG.
  • Community support via opening issues or WeChat groups.

Licensing & Compatibility

  • Licensed under the Apache License 2.0.
  • Permissive license suitable for commercial use and integration with closed-source projects.

Limitations & Caveats

The framework relies on external LLM APIs for data generation, requiring API keys and potentially incurring costs. Performance gains are benchmark-dependent and may vary based on the quality of the knowledge graph construction and the chosen LLM models.

Health Check
Last commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
5
Issues (30d)
1
Star History
55 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.