Discover and explore top open-source AI tools and projects—updated daily.
jondurbinFine-tuning pipeline for language models, "with everything."
Top 83.8% on SourcePulse
This repository provides a comprehensive dataset and fine-tuning scripts for creating instruction-following language models. It targets researchers and developers aiming to build highly capable LLMs by leveraging a diverse collection of supervised fine-tuning (SFT) and direct preference optimization (DPO) data, along with flexible prompting strategies.
How It Works
Bagel constructs a composite dataset by integrating numerous SFT and DPO sources, including instruction-following, coding, reasoning, and role-playing data. It employs a deduplication strategy based on UUID v5 of instructions to prioritize higher-confidence data sources. The project uniquely supports four distinct prompt formats (Vicuna, Llama-2, Alpaca, ChatML) for each data point, aiming to enhance model generalization across different conversational styles.
Quick Start & Requirements
python -m bagel.dataaccelerate, deepspeed, wandb, flash-attention-2. Requires significant disk space for datasets.accelerate launch with provided example scripts for SFT and DPO phases.bf16 (16-bit precision) and deepspeed for distributed training, indicating a need for multi-GPU setups.Highlighted Details
ai2_arc, evol-instruct, glaive-function-calling-v2, sql-create-context, and toxic-dpo.deepspeed for efficient distributed training and flash-attention-2 for performance.Maintenance & Community
The project is maintained by jondurbin. No specific community channels or roadmap links are provided in the README.
Licensing & Compatibility
The README does not explicitly state a license. The dataset sources themselves may have varying licenses. Compatibility for commercial use or closed-source linking is not specified.
Limitations & Caveats
The project is presented as a personal endeavor with example scripts that may require further testing and adaptation. The "toxic-dpo" dataset contains "highly toxic and potentially illegal content" for academic and lawful purposes only. The multi-prompt strategy effectively quadruples the training data size per epoch.
1 year ago
Inactive
hkust-nlp
minimaxir
mshumer