bagel by jondurbin

Fine-tuning pipeline for language models, "with everything."

Created 2 years ago

325 stars

Top 83.9% on SourcePulse

View on GitHub

4 Experts Love This Project

Jeremy Howard

Cofounder of fast.ai

Omar Sanseviero

DevRel at Google DeepMind

Maxime Labonne

Head of Post-Training at Liquid AI

Casper Hansen

Author of AutoAWQ

Project Summary

This repository provides a comprehensive dataset and fine-tuning scripts for creating instruction-following language models. It targets researchers and developers aiming to build highly capable LLMs by leveraging a diverse collection of supervised fine-tuning (SFT) and direct preference optimization (DPO) data, along with flexible prompting strategies.

How It Works

Bagel constructs a composite dataset by integrating numerous SFT and DPO sources, including instruction-following, coding, reasoning, and role-playing data. It employs a deduplication strategy based on UUID v5 of instructions to prioritize higher-confidence data sources. The project uniquely supports four distinct prompt formats (Vicuna, Llama-2, Alpaca, ChatML) for each data point, aiming to enhance model generalization across different conversational styles.

Quick Start & Requirements

Dataset Preparation: python -m bagel.data
Prerequisites: Python, accelerate, deepspeed, wandb, flash-attention-2. Requires significant disk space for datasets.
Fine-tuning: Uses accelerate launch with provided example scripts for SFT and DPO phases.
Resources: Training examples specify bf16 (16-bit precision) and deepspeed for distributed training, indicating a need for multi-GPU setups.
Links: Dataset preparation, SFT example, DPO example.

Highlighted Details

Supports a wide array of 30+ SFT and DPO datasets, including specialized ones like ai2_arc, evol-instruct, glaive-function-calling-v2, sql-create-context, and toxic-dpo.
Implements a multi-prompt formatting strategy, converting each instruction into Vicuna, Llama-2, Alpaca, and ChatML formats.
Includes decontamination steps using cosine similarity and approximate nearest neighbor search to mitigate benchmark contamination.
Fine-tuning scripts leverage deepspeed for efficient distributed training and flash-attention-2 for performance.

Maintenance & Community

The project is maintained by jondurbin. No specific community channels or roadmap links are provided in the README.

Licensing & Compatibility

The README does not explicitly state a license. The dataset sources themselves may have varying licenses. Compatibility for commercial use or closed-source linking is not specified.

Limitations & Caveats

The project is presented as a personal endeavor with example scripts that may require further testing and adaptation. The "toxic-dpo" dataset contains "highly toxic and potentially illegal content" for academic and lawful purposes only. The multi-prompt strategy effectively quadruples the training data size per epoch.

Health Check

Last Commit

1 year ago

Responsiveness

Inactive

Pull Requests (30d)

Issues (30d)

Star History

1 stars in the last 30 days