taxonomy  by instructlab

Taxonomy for LLM alignment tuning via synthetic data generation

created 1 year ago
274 stars

Top 95.2% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a structured taxonomy for contributing "skills" and "knowledge" to train Large Language Models (LLMs) using the InstructLab's Large-Scale Alignment for Chatbots (LAB) method. It targets researchers, developers, and power users seeking to enhance LLM capabilities with curated, synthetic data derived from community contributions.

How It Works

The core of the project is a hierarchical directory structure representing domains and subdomains, inspired by the Dewey Decimal Classification system. Contributions are made via qna.yaml files at the leaf nodes, containing question-answer pairs and optional context. Skills are performative or instructional, while knowledge contributions are fact-based, referencing external documents stored in a separate Git repository. This structured approach enables the generation of targeted synthetic data for LLM alignment tuning.

Quick Start & Requirements

  • Contribution: Follow the "Fork and Pull" model. Detailed guides are available in the CONTRIBUTING.md file.
  • Knowledge Contribution: Requires a separate Git repository for markdown or PDF documents (PDF support requires InstructLab v0.21.0+).
  • Resources: No specific compute requirements are listed for contributing to the taxonomy itself, but retraining models requires significant compute infrastructure.

Highlighted Details

  • Supports both "skills" (performative, e.g., writing poetry) and "knowledge" (factual, e.g., astronomy) contributions.
  • Skills can be "grounded" (requiring context) or "ungrounded."
  • Knowledge contributions can reference PDF documents (v0.21.0+) or markdown files.
  • The taxonomy structure directly influences data generation and LLM prompt engineering.

Maintenance & Community

  • Contributions are integrated into InstructLab models, with retraining intended to occur at least weekly.
  • Community contributions are encouraged for rapid model iteration.
  • Further details on contributing can be found in the documentation.

Licensing & Compatibility

  • The repository is licensed under the Apache License, Version 2.0.
  • This license permits forking and internal use for training private models.

Limitations & Caveats

  • Knowledge contributions require a separate Git repository and are subject to longer review times due to volume.
  • YAML formatting is strict; indentation and special characters require careful handling.
  • The README mentions that older InstructLab versions only support markdown for knowledge contributions.
Health Check
Last commit

5 days ago

Responsiveness

1 week

Pull Requests (30d)
11
Issues (30d)
0
Star History
13 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.