DataArc-SynData-Toolkit by DataArcTech

LLM synthetic data generation toolkit

Created 3 months ago

1,526 stars

Top 26.7% on SourcePulse

Project Summary

Synthetic data generation for LLM training is addressed by the DataArc SynData Toolkit, offering a user-friendly, modular platform for creating customized datasets. It targets engineers and researchers seeking to simplify data synthesis via zero-code CLI and GUI options, promising significant performance improvements for downstream models.

How It Works

The toolkit employs a modular pipeline architecture, allowing flexible customization of each synthetic data generation step. Core generation methods include leveraging local corpora, integrating with Huggingface datasets, and utilizing model distillation. Data filtering and rewriting capabilities are also integrated, enabling users to tailor synthesized data precisely to target model requirements. This decoupled design facilitates developer extensibility by inheriting base classes for custom strategies.

Quick Start & Requirements

Install: Clone the repository, then install dependencies using pip install uv followed by uv sync. Refer to the dependency and installation guide for detailed hardware and software prerequisites.
Configuration: Modify provided YAML configuration files (e.g., configs/example.yaml) based on specific needs.
Environment: Create a .env file specifying necessary API keys (e.g., OPENAI_API_KEY).
Run CLI: Execute synthesis via uv run sdg configs/example.yaml.
Run GUI: Launch the Gradio interface with uv run python app.py.

Highlighted Details

Performance Gains: Claims over 20% performance improvements in domains like Medical, Finance, and Law when models are trained with its synthetic data.
Multi-faceted Synthesis: Supports data generation from local corpora, Huggingface, and model distillation.
Broad Compatibility: Offers multilingual support (English, low-resource languages) and works with various model providers (local, OpenAI API).
User-Friendly Interfaces: Provides both a zero-code CLI and an interactive Gradio GUI for ease of use.

Maintenance & Community

Contributions are welcomed. Specific community channels (like Discord or Slack), roadmap details, or notable contributors/sponsorships are not detailed in the provided documentation.

Licensing & Compatibility

The repository's license is not explicitly stated in the provided README, which presents a significant adoption blocker and requires clarification for commercial use or integration into closed-source projects.

Limitations & Caveats

The toolkit is actively developed, with planned features like Arabic language support and an integrated model fine-tuning module slated for future releases, indicating these are not yet available. The absence of clear licensing information is a critical caveat for potential adopters.

DataArc-SynData-Toolkit by DataArcTech

Explore Similar Projects

awesome-synthetic-datasets by davanstrien

ToolkenGPT by Ber666

loong by camel-ai

textbook_quality by VikParuchuri

DataDreamer by datadreamer-dev

DataDesigner by NVIDIA-NeMo

synthetic-data-generator by argilla-io

Awesome-LLM-Synthetic-Data by wasiahmad

train-llm-from-scratch by FareedKhan-dev

LLM-PowerHouse-A-Curated-Guide-for-Large-Language-Models-with-Custom-Training-and-Inferencing by ghimiresunil

LLM-workshop-2024 by rasbt

instructlab by instructlab