augmentoolkit  by e-p-armstrong

Data toolkit for custom LLM creation using open-source AI

Created 2 years ago
1,797 stars

Top 23.8% on SourcePulse

GitHubView on GitHub
Project Summary

Augmentoolkit is an open-source toolkit designed to streamline the creation of custom datasets for training Large Language Models (LLMs). It empowers users, from hobbyists to professionals, to generate high-quality, domain-specific data quickly and cost-effectively, eliminating the need for expensive proprietary services like OpenAI.

How It Works

Augmentoolkit employs a modular pipeline architecture, allowing users to select and configure different data generation strategies. Key pipelines include QA generation for factual instruction tuning, RPToolkit for creative roleplaying data, and a classifier creator for training text classifiers. The system leverages open-source LLMs and offers extensive configuration options for prompts, models, and API providers, facilitating customization and efficient data processing.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.11+ recommended. For PDF processing, tesseract is required. API keys for chosen LLM providers are necessary.
  • Run: python run_augmentoolkit.py (for terminal) or python streamlit_app.py (for Web UI).
  • Resources: Can run on consumer hardware for cost-effectiveness, or leverage cloud GPU services like Runpod for larger tasks.
  • Docs: Quickstart Guide, Video Tutorials

Highlighted Details

  • Supports multiple data generation pipelines: QA, RPToolkit, and Classifier Creator.
  • Offers both terminal and Web UI interfaces for accessibility.
  • Features robust auto-resume functionality for interrupted runs.
  • Provides extensive video documentation and a Discord community for support.
  • Enables fine-tuning LLMs with generated data using provided Axolotl configs.

Maintenance & Community

  • Active development with recent updates (Sept 12th, 2024).
  • Collaboration with AlignmentLab AI mentioned for future pipelines.
  • Community hub via Discord server.
  • YouTube channel for tutorials and AI content.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • The COMPLETION_MODE is noted as out-of-date for the QA pipeline and not supported for RPToolkit.
  • The Classifier Creator pipeline currently only supports binary classification, though multiclass support is planned.
  • RPToolkit's depth-first processing might give a misleading impression of slow progress.
Health Check
Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)
1
Issues (30d)
2
Star History
19 stars in the last 30 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of "AI Engineering", "Designing Machine Learning Systems"), Didier Lopes Didier Lopes(Founder of OpenBB), and
3 more.

instructlab by instructlab

0.3%
1k
CLI tool for LLM alignment tuning via synthetic data
Created 1 year ago
Updated 6 days ago
Feedback? Help us improve.