taxonomy by instructlab

Taxonomy for LLM alignment tuning via synthetic data generation

Created 2 years ago

292 stars

Top 90.6% on SourcePulse

Project Summary

This repository provides a structured taxonomy for contributing "skills" and "knowledge" to train Large Language Models (LLMs) using the InstructLab's Large-Scale Alignment for Chatbots (LAB) method. It targets researchers, developers, and power users seeking to enhance LLM capabilities with curated, synthetic data derived from community contributions.

How It Works

The core of the project is a hierarchical directory structure representing domains and subdomains, inspired by the Dewey Decimal Classification system. Contributions are made via qna.yaml files at the leaf nodes, containing question-answer pairs and optional context. Skills are performative or instructional, while knowledge contributions are fact-based, referencing external documents stored in a separate Git repository. This structured approach enables the generation of targeted synthetic data for LLM alignment tuning.

Quick Start & Requirements

Contribution: Follow the "Fork and Pull" model. Detailed guides are available in the CONTRIBUTING.md file.
Knowledge Contribution: Requires a separate Git repository for markdown or PDF documents (PDF support requires InstructLab v0.21.0+).
Resources: No specific compute requirements are listed for contributing to the taxonomy itself, but retraining models requires significant compute infrastructure.

Highlighted Details

Supports both "skills" (performative, e.g., writing poetry) and "knowledge" (factual, e.g., astronomy) contributions.
Skills can be "grounded" (requiring context) or "ungrounded."
Knowledge contributions can reference PDF documents (v0.21.0+) or markdown files.
The taxonomy structure directly influences data generation and LLM prompt engineering.

Maintenance & Community

Contributions are integrated into InstructLab models, with retraining intended to occur at least weekly.
Community contributions are encouraged for rapid model iteration.
Further details on contributing can be found in the documentation.

Licensing & Compatibility

The repository is licensed under the Apache License, Version 2.0.
This license permits forking and internal use for training private models.

Limitations & Caveats

Knowledge contributions require a separate Git repository and are subject to longer review times due to volume.
YAML formatting is strict; indentation and special characters require careful handling.
The README mentions that older InstructLab versions only support markdown for knowledge contributions.

Health Check

Last Commit

6 months ago

Responsiveness

1 week

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days