DataArc-SynData-Toolkit  by DataArcTech

LLM synthetic data generation toolkit

Created 1 month ago
695 stars

Top 49.1% on SourcePulse

GitHubView on GitHub
Project Summary

Synthetic data generation for LLM training is addressed by the DataArc SynData Toolkit, offering a user-friendly, modular platform for creating customized datasets. It targets engineers and researchers seeking to simplify data synthesis via zero-code CLI and GUI options, promising significant performance improvements for downstream models.

How It Works

The toolkit employs a modular pipeline architecture, allowing flexible customization of each synthetic data generation step. Core generation methods include leveraging local corpora, integrating with Huggingface datasets, and utilizing model distillation. Data filtering and rewriting capabilities are also integrated, enabling users to tailor synthesized data precisely to target model requirements. This decoupled design facilitates developer extensibility by inheriting base classes for custom strategies.

Quick Start & Requirements

  1. Install: Clone the repository, then install dependencies using pip install uv followed by uv sync. Refer to the dependency and installation guide for detailed hardware and software prerequisites.
  2. Configuration: Modify provided YAML configuration files (e.g., configs/example.yaml) based on specific needs.
  3. Environment: Create a .env file specifying necessary API keys (e.g., OPENAI_API_KEY).
  4. Run CLI: Execute synthesis via uv run sdg configs/example.yaml.
  5. Run GUI: Launch the Gradio interface with uv run python app.py.

Highlighted Details

  • Performance Gains: Claims over 20% performance improvements in domains like Medical, Finance, and Law when models are trained with its synthetic data.
  • Multi-faceted Synthesis: Supports data generation from local corpora, Huggingface, and model distillation.
  • Broad Compatibility: Offers multilingual support (English, low-resource languages) and works with various model providers (local, OpenAI API).
  • User-Friendly Interfaces: Provides both a zero-code CLI and an interactive Gradio GUI for ease of use.

Maintenance & Community

Contributions are welcomed. Specific community channels (like Discord or Slack), roadmap details, or notable contributors/sponsorships are not detailed in the provided documentation.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README, which presents a significant adoption blocker and requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The toolkit is actively developed, with planned features like Arabic language support and an integrated model fine-tuning module slated for future releases, indicating these are not yet available. The absence of clear licensing information is a critical caveat for potential adopters.

Health Check
Last Commit

2 days ago

Responsiveness

Inactive

Pull Requests (30d)
1
Issues (30d)
1
Star History
493 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
3 more.

instructlab by instructlab

0.3%
1k
CLI tool for LLM alignment tuning via synthetic data
Created 1 year ago
Updated 6 days ago
Feedback? Help us improve.