Synthetic data pipeline for LLM alignment (ICLR 2025 paper)
Top 47.7% on sourcepulse
Magpie is an open-source pipeline for synthesizing high-quality alignment data for Large Language Models (LLMs) from scratch. It targets researchers and developers aiming to democratize AI by providing transparent and scalable methods for creating instruction-following and preference datasets, reducing reliance on costly human annotation or proprietary data.
How It Works
Magpie leverages the auto-regressive nature of aligned LLMs to generate both user queries and model responses by simply providing the LLM's pre-query template. This "prompting with nothing" approach bypasses the need for manual prompt engineering or seed questions, enabling efficient, large-scale data generation. The pipeline includes steps for data filtering, tagging, and conversion to standard formats like ShareGPT.
Quick Start & Requirements
conda create -n magpie python=3.10
), activate it (conda activate magpie
), and install requirements (pip install -r requirements.txt
).huggingface-cli login
), Python 3.10, and GPUs (RTX 4090 24G tested for 8B models; larger models may require multiple A100s).demo.ipynb
.bash scripts/magpie.sh
for single-turn data generation.Highlighted Details
Maintenance & Community
The project is associated with the ICLR 2025 paper "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing." Recent updates include new datasets for Llama-3.3, Qwen2.5, and reasoning tasks, as well as the release of MagpieLM-Chat models.
Licensing & Compatibility
The repository does not explicitly state a license in the README. However, the project aims to democratize AI and mentions using Llama-3 models, which have their own specific licenses. Compatibility for commercial use or closed-source linking would require clarification on the project's licensing.
Limitations & Caveats
While Magpie demonstrates strong performance, some models (e.g., Gemma-1.1, Mistral, Yi) are marked as requiring additional logit processors or filters for optimal results, indicating potential compatibility or quality tuning needs. The README does not detail specific hardware requirements for all supported models, and larger model generation may necessitate significant GPU resources.
4 months ago
1+ week