bagel  by jondurbin

Fine-tuning pipeline for language models, "with everything."

created 1 year ago
323 stars

Top 85.3% on sourcepulse

GitHubView on GitHub
Project Summary

This repository provides a comprehensive dataset and fine-tuning scripts for creating instruction-following language models. It targets researchers and developers aiming to build highly capable LLMs by leveraging a diverse collection of supervised fine-tuning (SFT) and direct preference optimization (DPO) data, along with flexible prompting strategies.

How It Works

Bagel constructs a composite dataset by integrating numerous SFT and DPO sources, including instruction-following, coding, reasoning, and role-playing data. It employs a deduplication strategy based on UUID v5 of instructions to prioritize higher-confidence data sources. The project uniquely supports four distinct prompt formats (Vicuna, Llama-2, Alpaca, ChatML) for each data point, aiming to enhance model generalization across different conversational styles.

Quick Start & Requirements

  • Dataset Preparation: python -m bagel.data
  • Prerequisites: Python, accelerate, deepspeed, wandb, flash-attention-2. Requires significant disk space for datasets.
  • Fine-tuning: Uses accelerate launch with provided example scripts for SFT and DPO phases.
  • Resources: Training examples specify bf16 (16-bit precision) and deepspeed for distributed training, indicating a need for multi-GPU setups.
  • Links: Dataset preparation, SFT example, DPO example.

Highlighted Details

  • Supports a wide array of 30+ SFT and DPO datasets, including specialized ones like ai2_arc, evol-instruct, glaive-function-calling-v2, sql-create-context, and toxic-dpo.
  • Implements a multi-prompt formatting strategy, converting each instruction into Vicuna, Llama-2, Alpaca, and ChatML formats.
  • Includes decontamination steps using cosine similarity and approximate nearest neighbor search to mitigate benchmark contamination.
  • Fine-tuning scripts leverage deepspeed for efficient distributed training and flash-attention-2 for performance.

Maintenance & Community

The project is maintained by jondurbin. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The dataset sources themselves may have varying licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a personal endeavor with example scripts that may require further testing and adaptation. The "toxic-dpo" dataset contains "highly toxic and potentially illegal content" for academic and lawful purposes only. The multi-prompt strategy effectively quadruples the training data size per epoch.

Health Check
Last commit

1 year ago

Responsiveness

1 day

Pull Requests (30d)
0
Issues (30d)
0
Star History
4 stars in the last 90 days

Explore Similar Projects

Feedback? Help us improve.