augmentoolkit by e-p-armstrong

Data toolkit for custom LLM creation using open-source AI

Created 2 years ago

1,797 stars

Top 23.8% on SourcePulse

View on GitHub

8 Experts Love This Project

Didier Lopes

Founder of OpenBB

Vincent Weisser

Cofounder of Prime Intellect

Wing Lian

Founder of Axolotl AI

Michael Han

Cofounder of Unsloth

and 4 more!

Project Summary

Augmentoolkit is an open-source toolkit designed to streamline the creation of custom datasets for training Large Language Models (LLMs). It empowers users, from hobbyists to professionals, to generate high-quality, domain-specific data quickly and cost-effectively, eliminating the need for expensive proprietary services like OpenAI.

How It Works

Augmentoolkit employs a modular pipeline architecture, allowing users to select and configure different data generation strategies. Key pipelines include QA generation for factual instruction tuning, RPToolkit for creative roleplaying data, and a classifier creator for training text classifiers. The system leverages open-source LLMs and offers extensive configuration options for prompts, models, and API providers, facilitating customization and efficient data processing.

Quick Start & Requirements

Install: pip install -r requirements.txt
Prerequisites: Python 3.11+ recommended. For PDF processing, tesseract is required. API keys for chosen LLM providers are necessary.
Run: python run_augmentoolkit.py (for terminal) or python streamlit_app.py (for Web UI).
Resources: Can run on consumer hardware for cost-effectiveness, or leverage cloud GPU services like Runpod for larger tasks.
Docs: Quickstart Guide, Video Tutorials

Highlighted Details

Supports multiple data generation pipelines: QA, RPToolkit, and Classifier Creator.
Offers both terminal and Web UI interfaces for accessibility.
Features robust auto-resume functionality for interrupted runs.
Provides extensive video documentation and a Discord community for support.
Enables fine-tuning LLMs with generated data using provided Axolotl configs.

Maintenance & Community

Active development with recent updates (Sept 12th, 2024).
Collaboration with AlignmentLab AI mentioned for future pipelines.
Community hub via Discord server.
YouTube channel for tutorials and AI content.

Licensing & Compatibility

MIT License.
Compatible with commercial use and closed-source linking.

Limitations & Caveats

The COMPLETION_MODE is noted as out-of-date for the QA pipeline and not supported for RPToolkit.
The Classifier Creator pipeline currently only supports binary classification, though multiclass support is planned.
RPToolkit's depth-first processing might give a misleading impression of slow progress.

Health Check

Last Commit

2 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

19 stars in the last 30 days