augmentoolkit  by e-p-armstrong

Data toolkit for custom LLM creation using open-source AI

Created 1 year ago
1,736 stars

Top 24.6% on SourcePulse

GitHubView on GitHub
Project Summary

Augmentoolkit is an open-source toolkit designed to streamline the creation of custom datasets for training Large Language Models (LLMs). It empowers users, from hobbyists to professionals, to generate high-quality, domain-specific data quickly and cost-effectively, eliminating the need for expensive proprietary services like OpenAI.

How It Works

Augmentoolkit employs a modular pipeline architecture, allowing users to select and configure different data generation strategies. Key pipelines include QA generation for factual instruction tuning, RPToolkit for creative roleplaying data, and a classifier creator for training text classifiers. The system leverages open-source LLMs and offers extensive configuration options for prompts, models, and API providers, facilitating customization and efficient data processing.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.11+ recommended. For PDF processing, tesseract is required. API keys for chosen LLM providers are necessary.
  • Run: python run_augmentoolkit.py (for terminal) or python streamlit_app.py (for Web UI).
  • Resources: Can run on consumer hardware for cost-effectiveness, or leverage cloud GPU services like Runpod for larger tasks.
  • Docs: Quickstart Guide, Video Tutorials

Highlighted Details

  • Supports multiple data generation pipelines: QA, RPToolkit, and Classifier Creator.
  • Offers both terminal and Web UI interfaces for accessibility.
  • Features robust auto-resume functionality for interrupted runs.
  • Provides extensive video documentation and a Discord community for support.
  • Enables fine-tuning LLMs with generated data using provided Axolotl configs.

Maintenance & Community

  • Active development with recent updates (Sept 12th, 2024).
  • Collaboration with AlignmentLab AI mentioned for future pipelines.
  • Community hub via Discord server.
  • YouTube channel for tutorials and AI content.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • The COMPLETION_MODE is noted as out-of-date for the QA pipeline and not supported for RPToolkit.
  • The Classifier Creator pipeline currently only supports binary classification, though multiclass support is planned.
  • RPToolkit's depth-first processing might give a misleading impression of slow progress.
Health Check
Last Commit

1 month ago

Responsiveness

1 week

Pull Requests (30d)
0
Issues (30d)
0
Star History
28 stars in the last 30 days

Explore Similar Projects

Feedback? Help us improve.