Fine-tuning pipeline for language models, "with everything."
Top 85.3% on sourcepulse
This repository provides a comprehensive dataset and fine-tuning scripts for creating instruction-following language models. It targets researchers and developers aiming to build highly capable LLMs by leveraging a diverse collection of supervised fine-tuning (SFT) and direct preference optimization (DPO) data, along with flexible prompting strategies.
How It Works
Bagel constructs a composite dataset by integrating numerous SFT and DPO sources, including instruction-following, coding, reasoning, and role-playing data. It employs a deduplication strategy based on UUID v5 of instructions to prioritize higher-confidence data sources. The project uniquely supports four distinct prompt formats (Vicuna, Llama-2, Alpaca, ChatML) for each data point, aiming to enhance model generalization across different conversational styles.
Quick Start & Requirements
python -m bagel.data
accelerate
, deepspeed
, wandb
, flash-attention-2
. Requires significant disk space for datasets.accelerate launch
with provided example scripts for SFT and DPO phases.bf16
(16-bit precision) and deepspeed
for distributed training, indicating a need for multi-GPU setups.Highlighted Details
ai2_arc
, evol-instruct
, glaive-function-calling-v2
, sql-create-context
, and toxic-dpo
.deepspeed
for efficient distributed training and flash-attention-2
for performance.Maintenance & Community
The project is maintained by jondurbin. No specific community channels or roadmap links are provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. The dataset sources themselves may have varying licenses. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is presented as a personal endeavor with example scripts that may require further testing and adaptation. The "toxic-dpo" dataset contains "highly toxic and potentially illegal content" for academic and lawful purposes only. The multi-prompt strategy effectively quadruples the training data size per epoch.
1 year ago
1 day