airoboros  by jondurbin

Self-instruct tool for LLM finetuning

created 2 years ago
1,048 stars

Top 36.5% on sourcepulse

GitHubView on GitHub
Project Summary

Airoboros is a tool for automating the creation of high-quality datasets for fine-tuning large language models (LLMs). It addresses the challenge of data scarcity by enabling users to generate diverse and task-specific instruction datasets using LLMs, aiming to empower individuals and smaller organizations to build specialized AI models without the prohibitive costs of large-scale data curation or reliance on proprietary models.

How It Works

Airoboros implements a modified "self-instruct" approach, leveraging LLMs to generate synthetic instruction-following data. Key differentiators include support for OpenAI's chat completion API (enabling cost-effective use of GPT-3.5-turbo and GPT-4), customizable topic generation, and an in-memory vector database (Chroma) for efficient similarity comparisons. The system utilizes asyncio for concurrent data generation with configurable batch sizes and employs specialized "instructors" to create data for various use cases like reasoning, role-playing, and function calling, ensuring context relevance.

Quick Start & Requirements

  • Install: pip install --no-build-isolation airoboros or pip install -e --no-build-isolation ./airoboros from a cloned repository.
  • Prerequisites: Python, OpenAI API key (for data generation), potentially PyTorch nightly for flash attention (pip install -U --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu118).
  • Setup: Configuration is managed via YAML files for instruction and topic generation.
  • Docs: https://github.com/jondurbin/airoboros

Highlighted Details

  • Supports OpenAI API endpoints /v1/completions and /v1/chat/completions.
  • Features an ensemble of "experts" (LoRA adapters) dynamically loaded based on request similarity.
  • Offers multiple routing mechanisms: FAISS similarity search (default), or an agent-based router.
  • Includes scripts for segmenting data and fine-tuning expert adapters (requires a fork of qlora).

Maintenance & Community

  • Project is sponsored by a16z.
  • Author: jondurbin.
  • Support links: bmc.link/jondurbin, ETH/BTC donation addresses provided.

Licensing & Compatibility

  • The README does not explicitly state a license. The presence of "research use only" for some models and the dependency on OpenAI API keys suggest potential restrictions for commercial use or closed-source integration without careful review.

Limitations & Caveats

The project relies heavily on OpenAI's API for its core data generation functionality, which may not be suitable for fully open-source or air-gapped environments. Some components, like the vllm server, are noted as having "not quite as good" results. The "research use only" designation for certain model datasets requires careful consideration for commercial applications.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
12 stars in the last 90 days

Explore Similar Projects

Starred by Chip Huyen Chip Huyen(Author of AI Engineering, Designing Machine Learning Systems) and Elie Bursztein Elie Bursztein(Cybersecurity Lead at Google DeepMind).

LightRAG by HKUDS

1.0%
19k
RAG framework for fast, simple retrieval-augmented generation
created 10 months ago
updated 23 hours ago
Feedback? Help us improve.