This repository provides a structured taxonomy for contributing "skills" and "knowledge" to train Large Language Models (LLMs) using the InstructLab's Large-Scale Alignment for Chatbots (LAB) method. It targets researchers, developers, and power users seeking to enhance LLM capabilities with curated, synthetic data derived from community contributions.
How It Works
The core of the project is a hierarchical directory structure representing domains and subdomains, inspired by the Dewey Decimal Classification system. Contributions are made via qna.yaml
files at the leaf nodes, containing question-answer pairs and optional context. Skills are performative or instructional, while knowledge contributions are fact-based, referencing external documents stored in a separate Git repository. This structured approach enables the generation of targeted synthetic data for LLM alignment tuning.
Quick Start & Requirements
- Contribution: Follow the "Fork and Pull" model. Detailed guides are available in the
CONTRIBUTING.md
file.
- Knowledge Contribution: Requires a separate Git repository for markdown or PDF documents (PDF support requires InstructLab v0.21.0+).
- Resources: No specific compute requirements are listed for contributing to the taxonomy itself, but retraining models requires significant compute infrastructure.
Highlighted Details
- Supports both "skills" (performative, e.g., writing poetry) and "knowledge" (factual, e.g., astronomy) contributions.
- Skills can be "grounded" (requiring context) or "ungrounded."
- Knowledge contributions can reference PDF documents (v0.21.0+) or markdown files.
- The taxonomy structure directly influences data generation and LLM prompt engineering.
Maintenance & Community
- Contributions are integrated into InstructLab models, with retraining intended to occur at least weekly.
- Community contributions are encouraged for rapid model iteration.
- Further details on contributing can be found in the documentation.
Licensing & Compatibility
- The repository is licensed under the Apache License, Version 2.0.
- This license permits forking and internal use for training private models.
Limitations & Caveats
- Knowledge contributions require a separate Git repository and are subject to longer review times due to volume.
- YAML formatting is strict; indentation and special characters require careful handling.
- The README mentions that older InstructLab versions only support markdown for knowledge contributions.