Data toolkit for custom LLM creation using open-source AI
Top 25.7% on sourcepulse
Augmentoolkit is an open-source toolkit designed to streamline the creation of custom datasets for training Large Language Models (LLMs). It empowers users, from hobbyists to professionals, to generate high-quality, domain-specific data quickly and cost-effectively, eliminating the need for expensive proprietary services like OpenAI.
How It Works
Augmentoolkit employs a modular pipeline architecture, allowing users to select and configure different data generation strategies. Key pipelines include QA generation for factual instruction tuning, RPToolkit for creative roleplaying data, and a classifier creator for training text classifiers. The system leverages open-source LLMs and offers extensive configuration options for prompts, models, and API providers, facilitating customization and efficient data processing.
Quick Start & Requirements
pip install -r requirements.txt
tesseract
is required. API keys for chosen LLM providers are necessary.python run_augmentoolkit.py
(for terminal) or python streamlit_app.py
(for Web UI).Highlighted Details
Maintenance & Community
Licensing & Compatibility
Limitations & Caveats
COMPLETION_MODE
is noted as out-of-date for the QA pipeline and not supported for RPToolkit.1 week ago
1 week