augmentoolkit  by e-p-armstrong

Data toolkit for custom LLM creation using open-source AI

created 1 year ago
1,692 stars

Top 25.7% on sourcepulse

GitHubView on GitHub
Project Summary

Augmentoolkit is an open-source toolkit designed to streamline the creation of custom datasets for training Large Language Models (LLMs). It empowers users, from hobbyists to professionals, to generate high-quality, domain-specific data quickly and cost-effectively, eliminating the need for expensive proprietary services like OpenAI.

How It Works

Augmentoolkit employs a modular pipeline architecture, allowing users to select and configure different data generation strategies. Key pipelines include QA generation for factual instruction tuning, RPToolkit for creative roleplaying data, and a classifier creator for training text classifiers. The system leverages open-source LLMs and offers extensive configuration options for prompts, models, and API providers, facilitating customization and efficient data processing.

Quick Start & Requirements

  • Install: pip install -r requirements.txt
  • Prerequisites: Python 3.11+ recommended. For PDF processing, tesseract is required. API keys for chosen LLM providers are necessary.
  • Run: python run_augmentoolkit.py (for terminal) or python streamlit_app.py (for Web UI).
  • Resources: Can run on consumer hardware for cost-effectiveness, or leverage cloud GPU services like Runpod for larger tasks.
  • Docs: Quickstart Guide, Video Tutorials

Highlighted Details

  • Supports multiple data generation pipelines: QA, RPToolkit, and Classifier Creator.
  • Offers both terminal and Web UI interfaces for accessibility.
  • Features robust auto-resume functionality for interrupted runs.
  • Provides extensive video documentation and a Discord community for support.
  • Enables fine-tuning LLMs with generated data using provided Axolotl configs.

Maintenance & Community

  • Active development with recent updates (Sept 12th, 2024).
  • Collaboration with AlignmentLab AI mentioned for future pipelines.
  • Community hub via Discord server.
  • YouTube channel for tutorials and AI content.

Licensing & Compatibility

  • MIT License.
  • Compatible with commercial use and closed-source linking.

Limitations & Caveats

  • The COMPLETION_MODE is noted as out-of-date for the QA pipeline and not supported for RPToolkit.
  • The Classifier Creator pipeline currently only supports binary classification, though multiclass support is planned.
  • RPToolkit's depth-first processing might give a misleading impression of slow progress.
Health Check
Last commit

1 week ago

Responsiveness

1 week

Pull Requests (30d)
2
Issues (30d)
4
Star History
278 stars in the last 90 days

Explore Similar Projects

Starred by Omar Sanseviero Omar Sanseviero(DevRel at Google DeepMind), Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems), and
4 more.

argilla by argilla-io

0.4%
5k
Collaboration tool for building high-quality AI datasets
created 4 years ago
updated 5 days ago
Starred by John Yang John Yang(Author of SWE-bench, SWE-agent), Lysandre Debut Lysandre Debut(Chief Open-Source Officer at Hugging Face), and
3 more.

cleanrl by vwxyzjn

0.5%
8k
RL algorithms implementation with research-friendly features
created 6 years ago
updated 3 weeks ago
Feedback? Help us improve.